In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")
#sns.set_context("poster")

# Feature Selection

The element that has the biggest impact in the quality of your model is data features. You can only include in your model the attributes that you have and if they are not relevant, partially relevant or don't caputre the causality relationships behind the model, or introduce other relationships that correspond to other causes different from the ones that you want to investigate, then you'll have a poor model. 

Selecting the relevant features that add to your model is therefore of the utmost importance. 

In this notebook we will deal with four approaches:

        1) Univaritate Selection.
        2) Recursive feature elimination.
        3) PCA - Principal Component Analysis.
        4) Estimating feature importance.

Feature selection is a process where you select those features in your data that contribute most to the variable of interest. Irrelevant features decrease the accuracy of many models because you try to adjust on noise, this is particularly important in the case of linear models, such as linear and logistic regressions, where all features are always taken into accout. Three are the main benefits of feature selection:

        1) Reduces overfitting. Less redundant data implies less decisions made on noise. 
        2) Improves accuracy. Less misleading data results in a more accurate model. 
        3) Reduces training time. Less data implies faster training. 
        
Scikitlearn has a nice and short article on feature selection where you can learn more https://scikit-learn.org/stable/modules/feature_selection.html

Again we will use the Pima Indians onset of diabetes dataset. 


<img src="Pima_indians_cowboy_1889.jpg">

In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one target variable, <b>Outcome</b>. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
<blockquote>
        <ul style="list-style-type:square;">
            <li>Pregnancies</li> 
            <li>Glucose</li>
            <li>BloodPressure</li>
            <li>SkinThickness</li>
            <li>Insulin</li>
            <li>BMI</li>
            <li>DiabetesPedigreeFunction (scores de likelihood of diabetes based on family history)</li>
            <li>Age</li>
            <li>Outcome</li>
        </ul>
</blockquote>

In [2]:
# Load the Pima indians dataset and separate input and output components 

from numpy import set_printoptions
set_printoptions(precision=3)

filename="pima-indians-diabetes.data.csv"
names=["pregnancies", "glucose", "pressure", "skin", "insulin", "bmi", "pedi", "age", "outcome"]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()

# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
Y=array[:,8]
X
pd.DataFrame(X).head()

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

Unnamed: 0,0,1,2,3,4,5,6,7
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


<h1>Univariate Selection </h1>

One approach is to use statistical tests for example the Pearson Chi-Squared $\chi^2$ is commonly used to select the most significant features. 

We will use the <b> SelectKBest </b> class in scikit-learn.


In [3]:
# Univariate selection using Chi-squared 
set_printoptions(precision=3)
p_indians.head()

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 

# feature selection (we select the 4 best)
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)
print("Scores")

fit.scores_

print("The 4 attributes with the highest scores are: glucose, insulin, bmi and age ")
print()

features=fit.transform(X)
features[0:5,:]

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Scores


array([ 111.52 , 1411.887,   17.605,   53.108, 2175.565,  127.669,
          5.393,  181.304])

The 4 attributes with the highest scores are: glucose, insulin, bmi and age 



array([[148. ,   0. ,  33.6,  50. ],
       [ 85. ,   0. ,  26.6,  31. ],
       [183. ,   0. ,  23.3,  32. ],
       [ 89. ,  94. ,  28.1,  21. ],
       [137. , 168. ,  43.1,  33. ]])

<h1>Recursive Feature Elimination</h1>

This is a very intuitive approach. It consist on recursively removing attributes and building a model with those atrributes remaining. It uses the model accuracy to identify which atrributes or combination of attributes contribute the most. 

We will use it with a logistic regression, but the choice of algorithm doesn't matter too much as long as your are consistent. 

Recursive Feature Elimination uses the <b>RFE </b> class. 


In [4]:
# Recursive Feature Elimiantion

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

p_indians.head()

#Logistic regression
model = LogisticRegression(solver='liblinear')

rfe = RFE(model, 3) # we want to find the 3 top features
fit = rfe.fit(X, Y)

print(f'Number of features {fit.n_features_:d}')
print(f'Selected features {fit.support_}')
print(f'Ranking of features {fit.ranking_}')
print()
print("Top features seem to be pregnancies, bmi, and pedi(Diabetes Pedigree Function)")

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Number of features 3
Selected features [ True False False False False  True  True False]
Ranking of features [1 2 3 5 6 1 1 4]

Top features seem to be pregnancies, bmi, and pedi(Diabetes Pedigree Function)


<b><font color="red" size=6>Mission 1</font>

For this and the next mission we will use data from Kaggle In concrete from the World University Rankings Competition https://www.kaggle.com/mylesoneill/world-university-rankings

a) Using the Shanghai rankings find the top 3 most important features to explain them with both univariate and recursive (in recursive because we are using log regression create an output variable of being in the top 50 or not).
<br><br>
b) Same for the Times ranking. 
<br><br>
c) Does it change if we choose the top 10 or top 100?
</b>

In [26]:
#get the packages we ne
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Binarizer

In [27]:
#a) 
#read dataset
sh = pd.read_csv("shanghaiData.csv")
sh.head() # peak at the data

Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year
0,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,72.4,2005
1,2,University of Cambridge,1,73.6,99.8,93.4,53.3,56.6,70.9,66.9,2005
2,3,Stanford University,2,73.4,41.1,72.2,88.5,70.9,72.3,65.0,2005
3,4,"University of California, Berkeley",3,72.8,71.8,76.0,69.4,73.9,72.2,52.7,2005
4,5,Massachusetts Institute of Technology (MIT),4,70.1,74.0,80.6,66.7,65.8,64.3,53.0,2005


In [28]:
#clean Dataset
sh=sh[sh['year']==2015]
sh['world_rank'] = sh['world_rank'].str.replace('-', '').astype(int)

n50=50
def top50(a):
    if a['world_rank']<=n50:
        return 1
    else:
        return 0

sh['top50']= sh.apply(top50,axis=1)
sh['total_score'].fillna(0,inplace=True)
sh['ns'].fillna(0,inplace=True)
sh = sh.dropna()

#separate the dataset into input and output
array_SH=sh.values
independentsh=array_SH[:,4:11]
dependentsh=array_SH[:,11:]
dependentsh = dependentsh.astype("int")
pd.DataFrame(independentsh).head()

Unnamed: 0,0,1,2,3,4,5,6
0,100.0,100.0,100.0,100.0,100.0,76.6,2015
1,40.7,89.6,80.1,70.1,70.6,53.8,2015
2,68.2,80.7,60.6,73.1,61.1,68.0,2015
3,65.1,79.4,66.1,65.6,67.9,56.5,2015
4,77.1,96.6,50.8,55.6,66.4,55.8,2015


In [30]:
#select the 3 best features using Univariate Selection
chisquaresh = SelectKBest(score_func=chi2, k=4)
bestfitsh = chisquaresh.fit(independentsh, dependentsh)

bestfitsh.scores_
print("TOP 3 FEATURES: alumni, award and hici")

array([3525.677, 8019.428, 2872.417, 2572.629,  604.408,  523.418,
          0.   ])

TOP 3 FEATURES: alumni, award and hici


In [31]:
# select the 3 best features using Recursive Feature Elimination

# establish the linear logistic relationships of the variables
modelsh = LogisticRegression(solver='liblinear')

# get the most important 3 features using recursive feature elimination
rfesh = RFE(modelsh, 3)
fitsh = rfesh.fit(independentsh, np.ravel(dependentsh))

#print the output of the Recursive Feature Elimination
print(f'Selected features {fitsh.support_}')
print(f'Ranking of features {fitsh.ranking_}')
print("TOP 3 FEATURES: award, ns and pcp")

Selected features [False  True False  True False  True False]
Ranking of features [4 1 2 1 3 1 5]
TOP 3 FEATURES: award, ns and pcp


In [89]:
trial= pd.read_csv("timesData.csv")
trial["num_students"] = trial["num_students"].str.replace(',','')
trial["international_students"] = trial["international_students"].str.replace('%',' ')
trial = trial[:199]
trial.tail()
# trial = trial.apply(pd.to_numeric)
# trial["international_students"] = trial["international_students"].astype(int)
# trial["international_students"] = trial["international_students"]/100



Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
194,195,University of Vienna,Austria,47.6,63.2,45.7,45.6,27.0,46.7,34651,20.5,25,67 : 33,2011
195,196,Kent State University,United States of America,33.5,15.9,33.3,76.8,26.3,46.5,23122,19.0,8,58 : 42,2011
196,197,University of Illinois at Chicago,United States of America,57.8,51.8,46.8,34.7,-,46.4,24313,9.2,17,53 : 47,2011
197,197,Zhejiang University,China,54.6,29.6,41.3,44.3,70.3,46.4,47508,15.9,5,41 : 59,2011
198,199,Simon Fraser University,Canada,32.9,51.9,44.2,60.2,37.9,46.2,26640,28.3,19,55 : 45,2011


In [54]:
#b)
#get the FT dataset from Kaggle and set the first row as headers
df = pd.read_csv("timesData.csv")
df["num_students"] = df["num_students"].str.replace(',','')
df.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074,9.0,33%,37 : 63,2011
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596,7.8,22%,42 : 58,2011
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929,8.4,27%,45 : 55,2011


In [32]:
#clean dataset
df["num_students"] = df["num_students"].str.replace(',','')
df["world_rank"] = df["world_rank"].str.replace('=','')
df["international_students"] = df["international_students"].str.replace('%','')
df["female_male_ratio"] = df["female_male_ratio"].str[:2]

n=50
df=df[df['year']==2015]
df['world_rank'] = df['world_rank'].str.replace('-', '').astype(int)
df = df.drop(columns="country")
df = df.drop(columns="university_name")
df = df.replace({'-': ''}, regex=True)
df = df.drop(columns="total_score")

n50=50
def top(a):
    if a['world_rank']<=n50:
        return 1
    else:
        return 0

df['top50']= df.apply(top50,axis=1)

df = df.apply(pd.to_numeric)
df = df.dropna()
df.head(10)


#separate the dataset into input and output
array_FT=df.values
independent=array_FT[:,1:10]
dependent=array_FT[:,11]
independent
pd.DataFrame(independent).head()
pd.DataFrame(dependent).head()

Unnamed: 0,world_rank,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,top50
1402,1,92.2,67.0,98.1,99.7,89.1,2243.0,6.9,27.0,33.0,2015,1
1404,3,88.6,90.7,97.7,95.5,72.9,19919.0,11.6,34.0,46.0,2015,1
1405,4,91.5,69.0,96.7,99.1,63.1,15596.0,7.8,22.0,42.0,2015,1
1406,5,89.7,87.8,95.6,95.2,51.1,18812.0,11.8,34.0,46.0,2015,1
1407,6,89.1,84.3,88.2,100.0,95.7,11074.0,9.0,33.0,37.0,2015,1
1408,7,86.6,61.2,94.7,99.6,82.7,7929.0,8.4,27.0,45.0,2015,1
1409,8,84.2,58.5,96.7,99.1,44.8,36186.0,16.4,15.0,50.0,2015,1
1410,9,84.6,92.7,88.3,89.4,72.7,15060.0,11.7,51.0,37.0,2015,1
1411,9,88.5,59.8,90.8,94.0,42.0,11751.0,4.4,20.0,50.0,2015,1
1412,11,83.9,65.2,89.9,97.3,36.8,14221.0,6.9,21.0,42.0,2015,1


array([[92.2, 67. , 98.1, ...,  6.9, 27. , 33. ],
       [88.6, 90.7, 97.7, ..., 11.6, 34. , 46. ],
       [91.5, 69. , 96.7, ...,  7.8, 22. , 42. ],
       ...,
       [28.5, 36. , 27.5, ..., 18.3,  7. , 50. ],
       [17.8, 50.1, 22.4, ..., 32.2,  9. , 56. ],
       [16.2, 21.4,  8. , ..., 14.9,  2. , 31. ]])

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,92.2,67.0,98.1,99.7,89.1,2243.0,6.9,27.0,33.0
1,88.6,90.7,97.7,95.5,72.9,19919.0,11.6,34.0,46.0
2,91.5,69.0,96.7,99.1,63.1,15596.0,7.8,22.0,42.0
3,89.7,87.8,95.6,95.2,51.1,18812.0,11.8,34.0,46.0
4,89.1,84.3,88.2,100.0,95.7,11074.0,9.0,33.0,37.0


Unnamed: 0,0
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


In [10]:
#select the 3 best features using Univariate Selection
chisquare = SelectKBest(score_func=chi2, k=4)
bestfit = test.fit(independent, dependent)

bestfit.scores_
print("TOP 3 FEATURES: teaching, research and number of students")

array([1.317e+03, 6.026e+01, 2.087e+03, 2.645e+02, 8.286e+01, 2.449e+03,
       1.109e+02, 1.872e+02, 1.605e+00])

TOP 3 FEATURES: teaching, research and number of students


In [11]:
#select the 3 best features using Recursive Feature Elimination

#establish the linear logistic relationships of the variables
model2 = LogisticRegression(solver='liblinear')


#get the most important 3 features using recursive feature elimination
rfe2 = RFE(model2, 3)
fit2 = rfe2.fit(independent, dependent)

In [12]:
#print the output of the Recursive Feature Elimination
print(f'Selected features {fit2.support_}')
print(f'Ranking of features {fit2.ranking_}')
print("THE MOST IMPORTANT THREE ARE: teaching, female ratio and student staff ratio ")

Selected features [ True False  True False False False  True False False]
Ranking of features [1 3 1 6 4 7 1 2 5]
THE MOST IMPORTANT THREE ARE: teaching, female ratio and student staff ratio 


In [13]:
#c) SHANGHAI UNIVARIATE SELECTIONS FOR TOP 100 and TOP 10!!!!!!!!!!!

############  SHANGHAI TOP 10 and TOP 100 ##################
n10=10
def top10(a):
    if a['world_rank']<=n10:
        return 1
    else:
        return 0

sh['top10']= sh.apply(top10,axis=1)

n100=100
def top100(a):
    if a['world_rank']<=n100:
        return 1
    else:
        return 0

sh['top100']= sh.apply(top100,axis=1)

sh.head(12)

#separate the dataset into input and output
array_SH=sh.values

#independent variable selection
independentsh=array_SH[:,4:11]

#depedent variable selection
dependentsh10=array_SH[:,12:13].astype("int")

dependentsh100=array_SH[:,13:].astype("int")

###################### UNIVARIATE TOP10 and TOP100

#select the 3 best features using Univariate Selection
chisquaresh10 = SelectKBest(score_func=chi2, k=4)
bestfitsh10 = test.fit(independentsh, dependentsh10)

chisquaresh100 = SelectKBest(score_func=chi2, k=4)
bestfitsh100 = test.fit(independentsh, dependentsh100)

bestfitsh10.scores_
print("TOP 3 FEATURES for TOP10: alumni, award and hici")
bestfitsh100.scores_
print("TOP 3 Features for TOP100: alumni, hici and pub")

###################### RECURSIVE FEATURE SELECTION TOP10 and TOP100

#establish the linear logistic relationships of the variables
model3 = LogisticRegression(solver='liblinear')
model4 = LogisticRegression(solver='liblinear')

#get the most important 3 features using recursive feature elimination
rfe3 = RFE(model3, 3)
fit3 = rfe3.fit(independentsh, dependentsh10)

rfe4 = RFE(model4, 3)
fit4 = rfe4.fit(independentsh, dependentsh100)


print(f'Selected features {fit3.support_}')
print(f'Ranking of features {fit3.ranking_}')
print("THE MOST IMPORTANT THREE FOR TOP 10: alumni, pub and hici")

print(f'Selected features {fit4.support_}')
print(f'Ranking of features {fit4.ranking_}')
print("THE MOST IMPORTANT THREE FOR TOP100: alumni, award and hici")




Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year,top50,top10,top100
4397,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,76.6,2015,1,1,1
4398,2,Stanford University,2,73.3,40.7,89.6,80.1,70.1,70.6,53.8,2015,1,1,1
4399,3,Massachusetts Institute of Technology (MIT),3,70.4,68.2,80.7,60.6,73.1,61.1,68.0,2015,1,1,1
4400,4,"University of California, Berkeley",4,69.6,65.1,79.4,66.1,65.6,67.9,56.5,2015,1,1,1
4401,5,University of Cambridge,1,68.8,77.1,96.6,50.8,55.6,66.4,55.8,2015,1,1,1
4402,6,Princeton University,5,61.0,53.3,93.4,57.1,43.0,42.4,70.3,2015,1,1,1
4403,7,California Institute of Technology,6,59.6,49.5,66.7,49.3,56.4,44.0,100.0,2015,1,1,1
4404,8,Columbia University,7,58.8,63.5,65.9,52.1,51.9,68.8,33.2,2015,1,1,1
4405,9,University of Chicago,8,57.1,59.8,86.3,49.0,42.9,49.8,42.0,2015,1,1,1
4406,10,University of Oxford,2,56.6,49.7,54.9,52.3,51.9,70.9,43.1,2015,1,1,1


array([3578.964, 7137.328, 2946.952, 2352.637,  673.253,  570.264,
          0.   ])

TOP 3 FEATURES for TOP10: alumni, award and hici


array([3578.964, 7137.328, 2946.952, 2352.637,  673.253,  570.264,
          0.   ])

TOP 3 Features for TOP100: alumni, hici and pub
Selected features [ True False  True False  True False False]
Ranking of features [1 4 1 3 1 2 5]
THE MOST IMPORTANT THREE FOR TOP 10: alumni, pub and hici
Selected features [ True  True  True False False False False]
Ranking of features [1 1 1 2 4 3 5]
THE MOST IMPORTANT THREE FOR TOP100: alumni, award and hici


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [14]:
#c) FINANCIAL TIMES UNIVARIATE SELECTIONS FOR TOP 100 and TOP 10!!!!!!!!!!!

############  TIMES TOP 10 and TOP100 ##################

df['top10']= df.apply(top10, axis=1)
df["top100"]= df.apply(top100, axis=1)
df.head(12)

#separate the dataset into input and output
array_FT=df.values
independent=array_FT[:,1:10]

dependent10=array_FT[:,12:13].astype("int")
dependent100 = array_FT[:,13:].astype("int")

#select the 3 best features using Univariate Selection
chisquare10 = SelectKBest(score_func=chi2, k=4)
bestfit10 = test.fit(independent, dependent10)

chisquare100 = SelectKBest(score_func=chi2, k=4)
bestfit100 = test.fit(independent, dependent100)


###################### UNIVARIATE TOP10 and TOP100
bestfit10.scores_
print("TOP 3 FEATURES: teaching, research and number of students")

bestfit100.scores_
print("TOP 3 FEATURES: teaching, research and number of students")


###################### RECURSIVE FEATURE SELECTION TOP10 AND TOP100

#establish the linear logistic relationships of the variables
model5 = LogisticRegression(solver='liblinear')
model6 = LogisticRegression(solver="liblinear")

#get the most important 3 features using recursive feature elimination
rfe5 = RFE(model5, 3)
fit5 = rfe5.fit(independent, dependent10)
rfe6 = RFE(model6, 3)
fit6 = rfe6.fit(independent, dependent100)

print(f'Selected features {fit5.support_}')
print(f'Ranking of features {fit5.ranking_}')
print("THE MOST IMPORTANT THREE ARE: research, student_staff_ration and female_male_ratio")

print(f'Selected features {fit6.support_}')
print(f'Ranking of features {fit6.ranking_}')
print("THE MOST IMPORTANT THREE ARE: research, student_staff_ratio and female_male_ratio")

Unnamed: 0,world_rank,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,top50,top10,top100
1402,1,92.2,67.0,98.1,99.7,89.1,2243.0,6.9,27.0,33.0,2015,1,1,1
1404,3,88.6,90.7,97.7,95.5,72.9,19919.0,11.6,34.0,46.0,2015,1,1,1
1405,4,91.5,69.0,96.7,99.1,63.1,15596.0,7.8,22.0,42.0,2015,1,1,1
1406,5,89.7,87.8,95.6,95.2,51.1,18812.0,11.8,34.0,46.0,2015,1,1,1
1407,6,89.1,84.3,88.2,100.0,95.7,11074.0,9.0,33.0,37.0,2015,1,1,1
1408,7,86.6,61.2,94.7,99.6,82.7,7929.0,8.4,27.0,45.0,2015,1,1,1
1409,8,84.2,58.5,96.7,99.1,44.8,36186.0,16.4,15.0,50.0,2015,1,1,1
1410,9,84.6,92.7,88.3,89.4,72.7,15060.0,11.7,51.0,37.0,2015,1,1,1
1411,9,88.5,59.8,90.8,94.0,42.0,11751.0,4.4,20.0,50.0,2015,1,1,1
1412,11,83.9,65.2,89.9,97.3,36.8,14221.0,6.9,21.0,42.0,2015,1,0,1


array([1.253e+03, 6.636e+01, 2.085e+03, 3.992e+02, 1.726e+02, 1.286e+04,
       9.806e+01, 1.197e+02, 1.366e+00])

TOP 3 FEATURES: teaching, research and number of students


array([1.253e+03, 6.636e+01, 2.085e+03, 3.992e+02, 1.726e+02, 1.286e+04,
       9.806e+01, 1.197e+02, 1.366e+00])

TOP 3 FEATURES: teaching, research and number of students
Selected features [False False  True False False False  True False  True]
Ranking of features [3 6 1 2 4 7 1 5 1]
THE MOST IMPORTANT THREE ARE: research, student_staff_ration and female_male_ratio
Selected features [False False  True False False False  True False  True]
Ranking of features [6 5 1 2 3 7 1 4 1]
THE MOST IMPORTANT THREE ARE: research, student_staff_ratio and female_male_ratio


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


<h1>Principal Component Analysis</h1>

Principal Component Analysis is a data reduction technique using linear algebra. The idea here is to "compress" several dimensions into pricipal components. 

One problem of PCA is the explainability. Once you compressed the attributes into principal components you can no longer to refer them individually establishing causality links or relationships. 

A property of PCA is that you can choose the number of dimensions or principal components. In our example we will select 3 principal components. 

For Principal Component Analysis you use the <b>PCA</b> class. 


In [15]:
from sklearn.decomposition import PCA

p_indians.head()

#PCA
pca = PCA(n_components=3)
pca_fit = pca.fit(X)

print(f"Explained variance: {pca_fit.explained_variance_ratio_}")
print()

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print("Principal Components have little resemblance to the source data attributes")
print()
print(pca_fit.components_)

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Explained variance: [0.889 0.062 0.026]

Principal Components have little resemblance to the source data attributes

[[-0.002  0.098  0.016  0.061  0.993  0.014  0.001 -0.004]
 [-0.023 -0.972 -0.142  0.058  0.095 -0.047 -0.001 -0.140]
 [-0.022  0.143 -0.922 -0.307  0.021 -0.132 -0.001 -0.125]]


<h1>Feature Importance </h1>

One of the added features of tree based algorithms is that they can be used to estimate the importance of each feature and use it to refine the model to different levels depending on where we want to situate ourselves in the tension between explainability and accuracy. 

In this example we are going to use the ExtraTreesClassifier, but the technique is commonly used in all tree algoritms. 

For this example of assessing feature importance with trees we will use the <b>ExtraTreesClassifier</b> class. 


In [16]:
from sklearn.ensemble import ExtraTreesClassifier

p_indians.head()

model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,Y)

print(model.feature_importances_)


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.110  0.236  0.102  0.080  0.073  0.142  0.121  0.136]


<b><font color="red" size=6>Mission 2</font>

a) Using the Shangai Data find the top attributes with a tree classifier for top-10, top-50 and top-100.  
<br>
b) Same for the Times ranking. 
<br><br>

</b>

In [17]:
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier


#REUSE THE DATASET WITH TOP10, TOP50 AND TOP100
sh.head(12)

#REUSE THE PREVIOUS DEPENDENT AND INDEPENDENT VARIABLE SELECTION FROM "array_SH"
array_SH

X = array_SH[:,4:10]

Y10 = array_SH[:,12:13].astype("int")
Y50 = array_SH[:,11].astype("int")
Y100 = array_SH[:,13].astype("int")

model10 = ExtraTreesClassifier(n_estimators=100)
model50 = ExtraTreesClassifier(n_estimators=100)
model100 = ExtraTreesClassifier(n_estimators=100)


print("############# Model Top 10 Results ################")
model10.fit(X,np.ravel(Y10))
print(model10.feature_importances_)
print("the top attributes are: award, hici and ns ")


print("############# Model Top 50 Results ################")
model50.fit(X,np.ravel(Y50))
print(model50.feature_importances_)
print("the top attributes are: award, hici and pub")


print("############# Model Top 100 Results ################")
model100.fit(X,np.ravel(Y100))
print(model100.feature_importances_)
print("the top attributes are: award, hici and pub")


Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year,top50,top10,top100
4397,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,76.6,2015,1,1,1
4398,2,Stanford University,2,73.3,40.7,89.6,80.1,70.1,70.6,53.8,2015,1,1,1
4399,3,Massachusetts Institute of Technology (MIT),3,70.4,68.2,80.7,60.6,73.1,61.1,68.0,2015,1,1,1
4400,4,"University of California, Berkeley",4,69.6,65.1,79.4,66.1,65.6,67.9,56.5,2015,1,1,1
4401,5,University of Cambridge,1,68.8,77.1,96.6,50.8,55.6,66.4,55.8,2015,1,1,1
4402,6,Princeton University,5,61.0,53.3,93.4,57.1,43.0,42.4,70.3,2015,1,1,1
4403,7,California Institute of Technology,6,59.6,49.5,66.7,49.3,56.4,44.0,100.0,2015,1,1,1
4404,8,Columbia University,7,58.8,63.5,65.9,52.1,51.9,68.8,33.2,2015,1,1,1
4405,9,University of Chicago,8,57.1,59.8,86.3,49.0,42.9,49.8,42.0,2015,1,1,1
4406,10,University of Oxford,2,56.6,49.7,54.9,52.3,51.9,70.9,43.1,2015,1,1,1


array([[1, 'Harvard University', '1', ..., 1, 1, 1],
       [2, 'Stanford University', '2', ..., 1, 1, 1],
       [3, 'Massachusetts Institute of Technology (MIT)', '3', ..., 1, 1,
        1],
       ...,
       [401500, 'Utah State University', '126-146', ..., 0, 0, 0],
       [401500, 'Vienna University of Technology', '4-6', ..., 0, 0, 0],
       [401500, 'Wake Forest University', '126-146', ..., 0, 0, 0]],
      dtype=object)

############# Model Top 10 Results ################


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.204  0.391  0.114  0.116  0.041  0.135]
the top attributes are: award, hici and ns 
############# Model Top 50 Results ################


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.099  0.249  0.213  0.247  0.128  0.064]
the top attributes are: award, hici and pub
############# Model Top 100 Results ################


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.106  0.241  0.218  0.200  0.130  0.105]
the top attributes are: award, hici and pub


In [18]:
############################ AND REPEAT FOR FINANCIAL TIMES
#REUSE THE DATASET WITH TOP10, TOP50 AND TOP100
df.head(12)

#SELECT THE PRINCIPAL COMPONENTS FOR EACH TO BE TESTED AFTER TRANSFORMING THE DATA TO ARRAYS
df_array = df.values

FTX = df_array[:,1:10]
FTY10 = df_array[:,12:13].astype("int")
FTY50 = df_array[:,11:12].astype("int")
FTY100 = df_array[:,13].astype("int")

FTmodel10 = ExtraTreesClassifier(n_estimators=100)
FTmodel50 = ExtraTreesClassifier(n_estimators=100)
FTmodel100 = ExtraTreesClassifier(n_estimators=100)

print("############# Model Top 10 Results ################")
FTmodel10.fit(FTX,np.ravel(FTY10))
print(FTmodel10.feature_importances_)
print("the top attributes are: teaching, research and citations ")

print("############# Model Top 50 Results ################")
FTmodel50.fit(FTX,np.ravel(FTY50))
print(FTmodel50.feature_importances_)
print("the top attributes are: teaching, research and citations")

print("############# Model Top 100 Results ################")
FTmodel100.fit(FTX,np.ravel(FTY100))
print(FTmodel100.feature_importances_)
print("the top attributes are: teaching, research and citations")

Unnamed: 0,world_rank,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,female_male_ratio,year,top50,top10,top100
1402,1,92.2,67.0,98.1,99.7,89.1,2243.0,6.9,27.0,33.0,2015,1,1,1
1404,3,88.6,90.7,97.7,95.5,72.9,19919.0,11.6,34.0,46.0,2015,1,1,1
1405,4,91.5,69.0,96.7,99.1,63.1,15596.0,7.8,22.0,42.0,2015,1,1,1
1406,5,89.7,87.8,95.6,95.2,51.1,18812.0,11.8,34.0,46.0,2015,1,1,1
1407,6,89.1,84.3,88.2,100.0,95.7,11074.0,9.0,33.0,37.0,2015,1,1,1
1408,7,86.6,61.2,94.7,99.6,82.7,7929.0,8.4,27.0,45.0,2015,1,1,1
1409,8,84.2,58.5,96.7,99.1,44.8,36186.0,16.4,15.0,50.0,2015,1,1,1
1410,9,84.6,92.7,88.3,89.4,72.7,15060.0,11.7,51.0,37.0,2015,1,1,1
1411,9,88.5,59.8,90.8,94.0,42.0,11751.0,4.4,20.0,50.0,2015,1,1,1
1412,11,83.9,65.2,89.9,97.3,36.8,14221.0,6.9,21.0,42.0,2015,1,0,1


############# Model Top 10 Results ################


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.374  0.037  0.312  0.095  0.026  0.037  0.040  0.039  0.040]
the top attributes are: teaching, research and citations 
############# Model Top 50 Results ################


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.332  0.021  0.405  0.101  0.024  0.019  0.026  0.042  0.028]
the top attributes are: teaching, research and citations
############# Model Top 100 Results ################


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.312  0.030  0.333  0.174  0.040  0.029  0.029  0.030  0.024]
the top attributes are: teaching, research and citations
