In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("notebook")
#sns.set_context("poster")

# Feature Selection

The element that has the biggest impact in the quality of your model is data features. You can only include in your model the attributes that you have and if they are not relevant, partially relevant or don't caputre the causality relationships behind the model, or introduce other relationships that correspond to other causes different from the ones that you want to investigate, then you'll have a poor model. 

Selecting the relevant features that add to your model is therefore of the utmost importance. 

In this notebook we will deal with four approaches:

        1) Univaritate Selection.
        2) Recursive feature elimination.
        3) PCA - Principal Component Analysis.
        4) Estimating feature importance.

Feature selection is a process where you select those features in your data that contribute most to the variable of interest. Irrelevant features decrease the accuracy of many models because you try to adjust on noise, this is particularly important in the case of linear models, such as linear and logistic regressions, where all features are always taken into accout. Three are the main benefits of feature selection:

        1) Reduces overfitting. Less redundant data implies less decisions made on noise. 
        2) Improves accuracy. Less misleading data results in a more accurate model. 
        3) Reduces training time. Less data implies faster training. 
        
Scikitlearn has a nice and short article on feature selection where you can learn more https://scikit-learn.org/stable/modules/feature_selection.html

Again we will use the Pima Indians onset of diabetes dataset. 


<img src="Pima_indians_cowboy_1889.jpg">

In this exercise we will use one of the traditional Machine Learning dataset, the Pima Indians diabetes dataset.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content
The datasets consists of several medical predictor variables and one target variable, <b>Outcome</b>. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
<blockquote>
        <ul style="list-style-type:square;">
            <li>Pregnancies</li> 
            <li>Glucose</li>
            <li>BloodPressure</li>
            <li>SkinThickness</li>
            <li>Insulin</li>
            <li>BMI</li>
            <li>DiabetesPedigreeFunction (scores de likelihood of diabetes based on family history)</li>
            <li>Age</li>
            <li>Outcome</li>
        </ul>
</blockquote>

In [2]:
# Load the Pima indians dataset and separate input and output components 

from numpy import set_printoptions
set_printoptions(precision=3)

filename="pima-indians-diabetes.data.csv"
names=["pregnancies", "glucose", "pressure", "skin", "insulin", "bmi", "pedi", "age", "outcome"]
p_indians=pd.read_csv(filename, names=names)
p_indians.head()

# First we separate into input and output components
array=p_indians.values
X=array[:,0:8]
Y=array[:,8]
# X
# Y
pd.DataFrame(X).head()

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Unnamed: 0,0,1,2,3,4,5,6,7
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


<h1>Univariate Selection </h1>

One approach is to use statistical tests for example the Pearson Chi-Squared $\chi^2$ is commonly used to select the most significant features. 

We will use the <b> SelectKBest </b> class in scikit-learn.


In [3]:
# Univariate selection using Chi-squared 
set_printoptions(precision=3)
p_indians.head()

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 

# feature selection (we select the 4 best)
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)
print("Scores")

fit.scores_

print("The 4 attributes with the highest scores are: glucose, insulin, bmi and age ")
print()

features=fit.transform(X)
features[0:5,:]                            #you fit the model and that's it



Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Scores


array([ 111.52 , 1411.887,   17.605,   53.108, 2175.565,  127.669,
          5.393,  181.304])

The 4 attributes with the highest scores are: glucose, insulin, bmi and age 



array([[148. ,   0. ,  33.6,  50. ],
       [ 85. ,   0. ,  26.6,  31. ],
       [183. ,   0. ,  23.3,  32. ],
       [ 89. ,  94. ,  28.1,  21. ],
       [137. , 168. ,  43.1,  33. ]])

<h1>Recursive Feature Elimination</h1>

This is a very intuitive approach. It consist on recursively removing attributes and building a model with those atrributes remaining. It uses the model accuracy to identify which atrributes or combination of attributes contribute the most. 

We will use it with a logistic regression, but the choice of algorithm doesn't matter too much as long as your are consistent. 

Recursive Feature Elimination uses the <b>RFE </b> class. 


In [4]:
# Recursive Feature Elimiantion

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

p_indians.head()

#Logistic regression
model = LogisticRegression(solver='liblinear')

rfe = RFE(model, 3) #  we want to find the 3 top features
fit = rfe.fit(X, Y)

print(f'Number of features {fit.n_features_:d}')
print(f'Selected features {fit.support_}')
print(f'Ranking of features {fit.ranking_}')
print()
print("Top features seem to be pregnancies, bmi, and pedi(Diabetes Pedigree Function)")
features=fit.transform(X)
features
# pd.DataFrame(features).head()

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Number of features 3
Selected features [ True False False False False  True  True False]
Ranking of features [1 2 3 5 6 1 1 4]

Top features seem to be pregnancies, bmi, and pedi(Diabetes Pedigree Function)


array([[ 6.   , 33.6  ,  0.627],
       [ 1.   , 26.6  ,  0.351],
       [ 8.   , 23.3  ,  0.672],
       ...,
       [ 5.   , 26.2  ,  0.245],
       [ 1.   , 30.1  ,  0.349],
       [ 1.   , 30.4  ,  0.315]])

<b><font color="red" size=6>Mission 1</font>

For this and the next mission we will use data from Kaggle In concrete from the World University Rankings Competition https://www.kaggle.com/mylesoneill/world-university-rankings

a) Using the Shanghai rankings find the top 3 most important features to explain them with both univariate and recursive (in recursive because we are using log regression create an output variable of being in the top 50 or not).
<br><br>
b) Same for the Times ranking. 
<br><br>
c) Does it change if we choose the top 10 or top 100?
</b>

In [5]:
# 1.a top3 Shanghai- Univariate and Recursive
sid = pd.read_csv('shanghaiData.csv')
sid = sid[pd.notnull(sid['ns'])]
sid['world_rank'] = sid['world_rank'].str.split('-').str.get(0).astype(int)
sid["top_50"]=sid["world_rank"].apply(lambda x: (1 if x<=50 else 0))
Xs=sid.iloc[:,4:10]
# Xs.head()
Ys = sid['top_50'].values



sid = pd.read_csv('shanghaiData.csv')
sid = sid[pd.notnull(sid['ns'])]
sid['world_rank'] = sid['world_rank'].str.split('-').str.get(0).astype(int)
sid["top_50"]=sid["world_rank"].apply(lambda x: (1 if x<=50 else 0))
# sid.head()
Xs=sid.iloc[:,4:10]
Xs.head()
Ys = sid['top_50'].values
Yu = sid['world_rank'].values

# sid.info()                                     #I did not take national rank into concideration,
                                               # because it has too many na when I convert it 
# sid.isna().sum(axis = 0)                       # to numeric (because it has dates)
print()
print('-------------------Univariate--------------')
print()

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(Xs,Ys)
fit.scores_
features=fit.transform(Xs)
features[0:5,:]                            
results = sorted(zip(fit.scores_, Xs),reverse=True)[:3]
# print('The selected features are:'+ str([x[1] for x in results])) 
print('The selected features using Univariate are: alumni, award and hici')

print()
print('-------------------Recursiveeeeeeee--------------')
print()


from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3) #  we want to find the 3 top features
fit = rfe.fit(Xs, Ys)
print(f'Number of features: {fit.n_features_:d}')
print(f'Selected features: {fit.support_}')
print(f'Ranking of features: {fit.ranking_}')
results = sorted(zip(fit.support_, sid.iloc[:,4:10].columns),reverse=True)[:3]
# print('The selected features are:'+ str([x[1] for x in results])) 
print('The selected features using Recursive are: award, hici and ns')


Unnamed: 0,alumni,award,hici,ns,pub,pcp
0,100.0,100.0,100.0,100.0,100.0,72.4
1,99.8,93.4,53.3,56.6,70.9,66.9
2,41.1,72.2,88.5,70.9,72.3,65.0
3,71.8,76.0,69.4,73.9,72.2,52.7
4,74.0,80.6,66.7,65.8,64.3,53.0



-------------------Univariate--------------



array([36736.822, 73328.165, 33908.102, 27383.534,  7615.474,  6954.132])

array([[100. , 100. , 100. ],
       [ 99.8,  93.4,  53.3],
       [ 41.1,  72.2,  88.5],
       [ 71.8,  76. ,  69.4],
       [ 74. ,  80.6,  66.7]])

The selected features using Univariate are: alumni, award and hici

-------------------Recursiveeeeeeee--------------

Number of features: 3
Selected features: [False  True  True  True False False]
Ranking of features: [4 1 1 1 2 3]
The selected features using Recursive are: award, hici and ns


In [6]:
# 1.b. top3 Times- Univariate and Recursive

originaltid = pd.read_csv('timesData.csv')
tid =originaltid.copy()
# tid.head()
# tid.info()
# tid.isna().sum(axis = 0)

tid['world_rank'] = tid['world_rank'].str.split('-').str.get(0)
tid.world_rank.replace('=','', regex=True, inplace=True)  
print("----------------------------------------------")

tid.income = pd.to_numeric(tid.income, errors='coerce')

tid.international = pd.to_numeric(tid.international, errors='coerce')

tid.international_students.replace('%','', regex=True, inplace=True)    # \D
tid.international_students = pd.to_numeric(tid.international_students, errors='coerce')

tid.num_students = tid.num_students.astype(str)
tid['num_students'] = tid.num_students.apply(lambda x: x.replace(',', ''))
tid.num_students = pd.to_numeric(tid.num_students, errors='coerce')

# tid.info()
# tid.isna().sum(axis = 0)
tid.drop(['female_male_ratio','country','university_name','total_score','year'],axis=1,inplace=True)
tid.dropna(inplace=True)
# tid.drop(['female_male_ratio','country','university_name','total_score'],axis=1,inplace=True)
# tid.info()
# tid.isna().sum(axis = 0)

#first I tried to get the ratio but I realized that there are too many Nan!! so I neglected it!!

tid.world_rank = pd.to_numeric(tid.world_rank, errors='coerce')
tid["top_50"]=tid["world_rank"].apply(lambda x: (1 if x<=50 else 0))
tid.iloc[0:5,:]

# Xt=pd.concat([tid.iloc[:,3:8],tid.iloc[:,9:12]],axis=1).values
Xt= tid.iloc[:,1:9]
Xt.head()
Yt = tid["top_50"].values
Ytu = tid["world_rank"].values

print()
print('-------------------Univariate--------------')
print()
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 
test = SelectKBest(score_func=chi2, k=3)
fit = test.fit(Xt,Yt.astype(int))
fit.scores_
print()
features=fit.transform(Xt)
features                           #you fit the model and that's it
print("The selected features are: teaching, research and num_students")
print()

print('-------------------Recursive--------------')
print()
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3)   #we want to find the 3 top features
fit = rfe.fit(Xt, Yt)

# Xt
# Ytu.shape
# test=pd.Series(Yt)
# test.value_counts()

print(f'Number of features: {fit.n_features_:d}')
print(f'Selected features: {fit.support_}')
print(f'Ranking of features: {fit.ranking_}')
print()
print("The top features are: research, student_staff_ratio and international_students")
# tid.top_50.value_counts()
# tid.info()

----------------------------------------------


Unnamed: 0,world_rank,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students,top_50
0,1,99.7,72.4,98.7,98.8,34.5,20152.0,8.9,25.0,1
1,2,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,27.0,1
2,3,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,33.0,1
3,4,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,22.0,1
5,6,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,34.0,1


Unnamed: 0,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students
0,99.7,72.4,98.7,98.8,34.5,20152.0,8.9,25.0
1,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,27.0
2,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,33.0
3,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,22.0
5,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,34.0



-------------------Univariate--------------



array([ 9884.791,   616.044, 14940.548,  3765.02 ,   807.877, 30756.283,
         571.191,  1276.154])




array([[9.970e+01, 9.870e+01, 2.015e+04],
       [9.770e+01, 9.800e+01, 2.243e+03],
       [9.780e+01, 9.140e+01, 1.107e+04],
       ...,
       [2.400e+01, 1.020e+01, 4.122e+03],
       [2.010e+01, 1.600e+01, 1.012e+04],
       [1.620e+01, 1.830e+01, 8.663e+03]])

The selected features are: teaching, research and num_students

-------------------Recursive--------------

Number of features: 3
Selected features: [False False  True False False False  True  True]
Ranking of features: [4 3 1 2 5 6 1 1]

The top features are: research, student_staff_ratio and international_students


In [7]:
# tid.num_students = tid.num_students.astype(str)

# tid['num_students'] = tid.num_students.apply(lambda x: x.replace(',', ''))
# # tid.num_students.replace('\D','', regex=True, inplace=True)
# tid.num_students = pd.to_numeric(tid.num_students, errors='coerce')
# tid.info()

In [13]:
# 1.c. top 10 top 100
print()
print('-------------------Shangai TOP 50 R--------------')
print()

print("The selected features are: award, hici and ns")

print()
print('-------------------Shangai TOP 10 R--------------')
print()

sid["top_10"]=sid["world_rank"].apply(lambda x: (1 if x<=10 else 0))
Ys = sid['top_10'].values
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3) #  we want to find the 3 top features
fit = rfe.fit(Xs, Ys)
# Xs.head()
print(f'Number of features: {fit.n_features_:d}')
print(f'Selected features: {fit.support_}')
print(f'Ranking of features: {fit.ranking_}')
print("The selected features are: alumni, hici and pub")

print()
print('-------------------Shangai TOP 100 R--------------')
print()

sid["top_100"]=sid["world_rank"].apply(lambda x: (1 if x<=100 else 0))
Ys = sid['top_100'].values
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3) #  we want to find the 3 top features
fit = rfe.fit(Xs, Ys)
print(f'Number of features: {fit.n_features_:d}')
print(f'Selected features: {fit.support_}')
print(f'Ranking of features: {fit.ranking_}')
print("The selected features are: award, hici and ns")
print("")
print("FOR SHANGAI WE NOTICE THAT THE TOP 50 AND TOP 100 ARE SIMILAR BUT THE TOP 10 DIFFERS ")

print()
print('-------------------Times TOP 50 R--------------')
print()
print("The top features are: research, student_staff_ratio and international_students")

print()
print('-------------------Times TOP 10 R--------------')
print()

tid["top_10"]=tid["world_rank"].apply(lambda x: (1 if x<=10 else 0))
Yt = tid["top_10"].values
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3) #  we want to find the 3 top features
# Xt.head()
fit = rfe.fit(Xt, Yt)
print(f'Number of features: {fit.n_features_:d}')
print(f'Selected features: {fit.support_}')
print(f'Ranking of features: {fit.ranking_}')
print()
print("The top features are: research, student_staff_ratio and international_students")

print()
print('-------------------Times TOP 100 R--------------')
print()

tid["top_100"]=tid["world_rank"].apply(lambda x: (1 if x<=100 else 0))
Yt = tid["top_100"].values
model = LogisticRegression(solver='liblinear')
rfe = RFE(model, 3) #  we want to find the 3 top features
fit = rfe.fit(Xt, Yt)
print(f'Number of features: {fit.n_features_:d}')
print(f'Selected features: {fit.support_}')
print(f'Ranking of features: {fit.ranking_}')
print()
print("The top features are: teaching, research and citations ")
print()
print("WE NOTICE CHANGES FOR BOTH THE TIMES AND SHANGHAI RANKINGS!")


-------------------Shangai TOP 50 R--------------

The selected features are: award, hici and ns

-------------------Shangai TOP 10 R--------------

Number of features: 3
Selected features: [ True False  True False  True False]
Ranking of features: [1 2 1 4 1 3]
The selected features are: alumni, hici and pub

-------------------Shangai TOP 100 R--------------

Number of features: 3
Selected features: [False  True  True  True False False]
Ranking of features: [3 1 1 1 2 4]
The selected features are: award, hici and ns

FOR SHANGAI WE NOTICE THAT THE TOP 50 AND TOP 100 ARE SIMILAR BUT THE TOP 10 DIFFERS 

-------------------Times TOP 50 R--------------

The top features are: research, student_staff_ratio and international_students

-------------------Times TOP 10 R--------------

Number of features: 3
Selected features: [False False  True False False False  True  True]
Ranking of features: [3 5 1 2 4 6 1 1]

The top features are: research, student_staff_ratio and international_students

<h1>Principal Component Analysis</h1>

Principal Component Analysis is a data reduction technique using linear algebra. The idea here is to "compress" several dimensions into pricipal components. 

One problem of PCA is the explainability. Once you compressed the attributes into principal components you can no longer to refer them individually establishing causality links or relationships. 

A property of PCA is that you can choose the number of dimensions or principal components. In our example we will select 3 principal components. 

For Principal Component Analysis you use the <b>PCA</b> class. 


In [9]:
from sklearn.decomposition import PCA

p_indians.head()

#PCA
pca = PCA(n_components=3)
pca_fit = pca.fit(X)

print(f"Explained variance: {pca_fit.explained_variance_ratio_}")
print()

np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
print("Principal Components have little resemblance to the source data attributes")
print()
print(pca_fit.components_)

Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Explained variance: [0.889 0.062 0.026]

Principal Components have little resemblance to the source data attributes

[[-0.002  0.098  0.016  0.061  0.993  0.014  0.001 -0.004]
 [-0.023 -0.972 -0.142  0.058  0.095 -0.047 -0.001 -0.140]
 [-0.022  0.143 -0.922 -0.307  0.021 -0.132 -0.001 -0.125]]


<h1>Feature Importance </h1>

One of the added features of tree based algorithms is that they can be used to estimate the importance of each feature and use it to refine the model to different levels depending on where we want to situate ourselves in the tension between explainability and accuracy. 

In this example we are going to use the ExtraTreesClassifier, but the technique is commonly used in all tree algoritms. 

For this example of assessing feature importance with trees we will use the <b>ExtraTreesClassifier</b> class. 


In [10]:
from sklearn.ensemble import ExtraTreesClassifier

p_indians.head()
model = ExtraTreesClassifier(n_estimators=100)
model.fit(X,Y)
print(model.feature_importances_)


Unnamed: 0,pregnancies,glucose,pressure,skin,insulin,bmi,pedi,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.110  0.237  0.097  0.081  0.073  0.140  0.120  0.143]


<b><font color="red" size=6>Mission 2</font>

a) Using the Shangai Data find the top attributes with a tree classifier for top-10, top-50 and top-100.  
<br>
b) Same for the Times ranking. 
<br><br>

</b>

In [14]:
# //chi prizes fina top 10 bas times ma fina haydik tbt elo ma3ak prize ? eh ! eh yala top 10 cz shanghai
# 2.a. Shangai Data - top attributes top10/50/100
from sklearn.ensemble import ExtraTreesClassifier
sid.iloc[0:5,:]
print("SHANGAI 10 ")
sid["top_10"]=sid["world_rank"].apply(lambda x: (1 if x<=10 else 0))
Ys = sid["top_10"]
model = ExtraTreesClassifier(n_estimators=100)
model.fit(Xs,Ys)
print(model.feature_importances_)
print("The top attributes (using trees) top 10 are: alumni, award and pcp")
print()
print("The selected features (using recursive) were: alumni, pub and hici")
print()
print("----------------------------------------------------------------")
print("SHANGAI 50")
print()
sid["top_50"]=sid["world_rank"].apply(lambda x: (1 if x<=50 else 0))
Ys = sid["top_50"]
model = ExtraTreesClassifier(n_estimators=100)
model.fit(Xs,Ys)
print(model.feature_importances_)
print("The top attributes (using trees) top 50 are: award, hici and ns")
print()
print("The selected features (using recursive) were: award, hici and ns")
print()
print("----------------------------------------------------------------")
print("SHANGAI 100")
print()
sid["top_100"]=sid["world_rank"].apply(lambda x: (1 if x<=100 else 0))
Ys = sid["top_100"]
model = ExtraTreesClassifier(n_estimators=100)
model.fit(Xs,Ys)
print(model.feature_importances_)
print("The top attributes (using trees) top 100 are: award, hici and ns")
print()
print("The selected features (using recursive) were: award, hici and ns")
# sid.info()


Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year,top_50,top_10,top_100
0,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,72.4,2005,1,1,1
1,2,University of Cambridge,1,73.6,99.8,93.4,53.3,56.6,70.9,66.9,2005,1,1,1
2,3,Stanford University,2,73.4,41.1,72.2,88.5,70.9,72.3,65.0,2005,1,1,1
3,4,"University of California, Berkeley",3,72.8,71.8,76.0,69.4,73.9,72.2,52.7,2005,1,1,1
4,5,Massachusetts Institute of Technology (MIT),4,70.1,74.0,80.6,66.7,65.8,64.3,53.0,2005,1,1,1


SHANGAI 10 


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.273  0.384  0.077  0.104  0.026  0.136]
The top attributes (using trees) top 10 are: alumni, award and pcp

The selected features (using recursive) were: alumni, pub and hici

----------------------------------------------------------------
SHANGAI 50



ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.101  0.197  0.239  0.247  0.147  0.069]
The top attributes (using trees) top 50 are: award, hici and ns

The selected features (using recursive) were: award, hici and ns

----------------------------------------------------------------
SHANGAI 100



ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.137  0.209  0.230  0.177  0.149  0.098]
The top attributes (using trees) top 100 are: award, hici and ns

The selected features (using recursive) were: award, hici and ns


In [12]:
# 2.b Same for Times

originaltid = pd.read_csv('timesData.csv')
tid =originaltid.copy()
# tid.head()
# tid.info()
# tid.isna().sum(axis = 0)

tid['world_rank'] = tid['world_rank'].str.split('-').str.get(0)
tid.world_rank.replace('=','', regex=True, inplace=True)  
print("----------------------------------------------")

tid.income = pd.to_numeric(tid.income, errors='coerce')

tid.international = pd.to_numeric(tid.international, errors='coerce')

tid.international_students.replace('%','', regex=True, inplace=True)    # \D
tid.international_students = pd.to_numeric(tid.international_students, errors='coerce')

tid.num_students = tid.num_students.astype(str)
tid['num_students'] = tid.num_students.apply(lambda x: x.replace(',', ''))
tid.num_students = pd.to_numeric(tid.num_students, errors='coerce')

# tid.info()
# tid.isna().sum(axis = 0)
tid.drop(['female_male_ratio','country','university_name','total_score','year'],axis=1,inplace=True)
tid.dropna(inplace=True)
# tid.drop(['female_male_ratio','country','university_name','total_score'],axis=1,inplace=True)
# tid.info()
# tid.isna().sum(axis = 0)

#first I tried to get the ratio but I realized that there are too many Nan!! so I neglected it!!

tid.world_rank = pd.to_numeric(tid.world_rank, errors='coerce')

# Xt=pd.concat([tid.iloc[:,3:8],tid.iloc[:,9:12]],axis=1).values
Xt= tid.iloc[:,1:9]
Xt.head()
Ytu = tid["world_rank"].values
tid.iloc[0:5,:]
tid["top_10"]=tid["world_rank"].apply(lambda x: (1 if x<=10 else 0))
print("TIMES top10")
Yt = tid["top_10"].values

from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier(n_estimators=100)
model.fit(Xt,Yt)
print(model.feature_importances_)
print("The top attributes for FT top 10 are: Teaching, Research, citations")
print()
print("The top features were (using recursive): research, student_staff_ratio and international_students ")
print()
print("----------------------------------------------------------------")
print("TIMES top50")
print()
tid["top_50"]=tid["world_rank"].apply(lambda x: (1 if x<=50 else 0))
Yt = tid["top_50"].values
model = ExtraTreesClassifier(n_estimators=100)
model.fit(Xt,Yt)
print(model.feature_importances_)
print("The top attributes for FT top 50 are: Teaching, Research, Citations")
print()
print("The top features were (using recursive): research, student_staff_ratio and international_students")
print()
print("----------------------------------------------------------------")
print("TIMES top100")
print()
tid["top_100"]=tid["world_rank"].apply(lambda x: (1 if x<=100 else 0))
Yt = tid["top_100"].values
model = ExtraTreesClassifier(n_estimators=100)
model.fit(Xt,Yt)
print(model.feature_importances_)
print("The top attributes for FT top 100 are: Teaching, Research, Citations")
print()
print("The top features were (using recursive): teaching, research and citations")
print()

# Xt


----------------------------------------------


Unnamed: 0,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students
0,99.7,72.4,98.7,98.8,34.5,20152.0,8.9,25.0
1,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,27.0
2,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,33.0
3,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,22.0
5,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,34.0


Unnamed: 0,world_rank,teaching,international,research,citations,income,num_students,student_staff_ratio,international_students
0,1,99.7,72.4,98.7,98.8,34.5,20152.0,8.9,25.0
1,2,97.7,54.6,98.0,99.9,83.7,2243.0,6.9,27.0
2,3,97.8,82.3,91.4,99.9,87.5,11074.0,9.0,33.0
3,4,98.3,29.5,98.1,99.2,64.3,15596.0,7.8,22.0
5,6,90.5,77.7,94.1,94.0,57.0,18812.0,11.8,34.0


TIMES top10


ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.304  0.047  0.324  0.103  0.042  0.054  0.058  0.068]
The top attributes for FT top 10 are: Teaching, Research, citations

The top features were (using recursive): research, student_staff_ratio and international_students 

----------------------------------------------------------------
TIMES top50



ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.291  0.036  0.377  0.141  0.039  0.034  0.034  0.049]
The top attributes for FT top 50 are: Teaching, Research, Citations

The top features were (using recursive): research, student_staff_ratio and international_students

----------------------------------------------------------------
TIMES top100



ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
                     max_depth=None, max_features='auto', max_leaf_nodes=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

[ 0.298  0.040  0.311  0.190  0.045  0.038  0.035  0.043]
The top attributes for FT top 100 are: Teaching, Research, Citations

The top features were (using recursive): teaching, research and citations

