## Feature selection

In the following excercises `fetch20newsgroups` wil bee analyzed. Features representing the documents will be evaluated from the point of view of a classifiaction task using different feature evaluation criteria. 

Choose (uncomment) several (3-5) groups from the data set. These groups will be regarded as different classes.

In [2]:
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_classif
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

categories = [
    'alt.atheism',
    'comp.graphics',
    'comp.os.ms-windows.misc',
    'comp.sys.ibm.pc.hardware',
    'comp.sys.mac.hardware',
    'comp.windows.x',
    'misc.forsale',
    #'rec.autos',
    #'rec.motorcycles',
    #'rec.sport.baseball',
    #'rec.sport.hockey',
    #'sci.crypt',
    #'sci.electronics',
    #'sci.med',
    'sci.space',
    #'soc.religion.christian',
    #'talk.politics.guns',
    #'talk.politics.mideast',
    #'talk.politics.misc',
    #'talk.religion.misc'
]
dataset = fetch_20newsgroups(subset='all', categories = categories, shuffle=True, random_state=42)

Converting documents to feature vectors.

In [None]:
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, max_features = None, stop_words='english')
docs_vectors = vectorizer.fit_transform(dataset.data)
print("Number of documents: %d\nNumber of features: %d" %docs_vectors.shape)

Splitting data into trainig and test set.

In [None]:
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(docs_vectors, dataset.target, test_size=0.33, random_state=0)

Classifier training.

In [None]:
from sklearn import tree
tree = tree.DecisionTreeClassifier()
tree.fit(data_train, target_train)
print("Classification error for the test set is: ", 1.0 - tree.score(data_test,target_test))

To perform feature selection class `SelectKBest` may be used. It lets evaluate features on the basis of a mearue given as a function argument. 

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest

The `fit` method calculates the measure value for each feature. These values may be read from attribute `scores_`.

The `transform` method modifies data set by leaving only given number of features, which achieved the highest scores.

The `fit_transform` method performs both scoring features and modifying data set. The same effect may be obtained after applying `fit` followed by `transform`.

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2
select_mutual = SelectKBest(mutual_info_classif, k = 100)
select_mutual.fit(data_train, target_train)
select_mutual.scores_

**Excercise 1 (2,5 pt)**

Modify the data set by leaving only N features, which achieve the highest values of chi2 score.  Train and test the decision tree after the modifiation. Does feature reduction change the value of classification error? Find the optimal value of N, i.e. the value minimizing classification error. Remember, that test data shoud never take part in adjusting the number of features. It means you need to split data into training and testing subsets first. Adjust N on the basis of training data in a cross-validation procedure. Then train the final tree for the identified optimal N on the basis of training subset and test it on the basis of test subset. 

**Excercise 2 (1 pt)**

Display 30 words regarding features achieving the highest scores of mutual information. Can you see the relation between the words and the classes?

## Feature extraction

Principal component analysis (PCA) is a method commonly used to extract uncorrelated features. It is omplemented in `PCA` class. The `n_components` parameter of the method defines the number of extracted new features.

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
digits = load_digits(n_class=10)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
pca = PCA(n_components = 2)
X_new = pca.fit_transform(X)
plt.scatter(X_new[:,0], X_new[:,1],c=y)
plt.show()

Linear discriminant analysis (LDA) is another feature extraction method. In contrast to PCA, it takes into account information about classes.

http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X, y)
plt.scatter(X_lda[:,0], X_lda[:,1], c=y)
plt.show()

**Excercise 3 (1 pt)**

For `digits` data set, perform PCA and plot a graph showing the variance values for successive principal components. The graph should be drawn for all components. The variance values can be read from the `pca.explained_variance_ratio_` attribute. 

**Excercise 4 (1,5 pt)**

Write a function that returns the number of features that need to be included in order to retain a given amount of variance (sum of variances associated with each component) after performing PCA. The amount of variance should be given as a parameter of the function. This parameter may take values from the range (0; 1]. If the input parameter is 1, which means 100% variance, the function should return the maximum possible number of features. 

# Faces

The `fetch_olivetti_faces` data set contains 400 images of size 64x64 presenting 40 people. 

In [None]:
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data
n_samples, n_features = faces.shape

In [None]:
image_shape = (64, 64)
plt.figure(figsize = (3 * 4, 3* 100))
for i, image in enumerate(faces[:20,:]):
    plt.subplot(100, 4, i + 1)
    plt.imshow(image.reshape(image_shape), cmap=plt.cm.gray)
plt.show()

Below PCA is performed for the given data set. The obtained components (eigenvectors of the covariance matrix of the data) are then displayed as images.

In [None]:
faces_centered = faces - faces.mean(axis=0)
pca = PCA(n_components=100)
pca.fit(faces_centered)
image_shape = (64, 64)
plt.figure(figsize = (3 * 4, 3* 100))
for i, image in enumerate(pca.components_[:20,:]):
    plt.subplot(100, 4, i + 1)
    plt.imshow(image.reshape(image_shape), cmap=plt.cm.gray)
plt.show()

**Excercise 5 (1,5 pt)**

Reduce the number of features describing face images using PCA. Use the function implemented in Excercise 5 to decide on the final number of new features. Perform training and testing of a selected classifier recognizing people in photos.

Note 1: the number of features after transformation will be equal to the value of `n_components` or to the number of images in the set if there are less images than` n_components`.

Note 2: PCA should be performed on the basis of training data, while the test data should be transformed using the transformation matrix obtained by running PCA on the training data. Explain why it is not appropriate to use PCA on the complete data set (training + testing).


**Excercise 6 (1 pt)**

Perform the same experiment as in Excercise 6 but using one of feature selection methods instead of PCA. Compare the results, paying attention to the number of features needed to be selected to get the result on the same level as in Excersice 6. 

# Parameter tuning

The `GridSearchCV` class enables training of the model and testing it for various parameters and the selection of a set of optimum parameters from the point of view of a given criterion (e.g. in the case of classifiers, this criterion is the minimum classification error). Testing the model for different parameter values is done in a cross-validation process. 

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In the example below, a decision tree for different attribute selection criteria and different values for the minimum number of examples in leaves will be trained for `digits` data.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix #, accuracy_score
data_train, data_test, target_train, target_test = train_test_split(digits.data, digits.target, test_size=0.3, random_state=0)
tree = DecisionTreeClassifier()
parameters ={'criterion': ['gini', 'entropy'], 'min_samples_leaf': [5,4,3]}
search = GridSearchCV(tree, parameters, cv=5)
search.fit(data_train, target_train)

The parameter values with the best score are saved in the `best_params_` attribute. 

In [None]:
search.best_params_

Detailed results for each parameter set can be read from the attribute `cv_results_`.

In [None]:
search.cv_results_

Testing the model for the best parameter settings.

In [None]:
search.score(data_test, target_test)

# Pipeline processing

If our process consists of several stages, the `Pipeline` class can be used.

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

The following example defines a process consisting of a feature extractor (PCA) and a classifier (decision tree).

In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([('extract', PCA()), ('classify', DecisionTreeClassifier())])

Pipelining is usually combined with the search for the optimum set of parameters. As a parameter of the function `GridSearchCV`, you should pass an object representing our multi-stage process. In the example below, optimum value of the number of components of PCA is seached in the defined pipeline. The names of the parameters are created by combining the name of the stage (`extract`) with the name of the appropriate parameter (` n_components`) using underscores.

In [None]:
parameters = {'extract__n_components': [10,20,30]}
search = GridSearchCV(pipe, parameters, cv=5)
search.fit(data_train, target_train)

In [None]:
search.best_params_

In [None]:
search.score(data_test, target_test)

**Excercise 7 (1,5 pt)**

For the selected data set, design a process consisting of feature selection, feature extraction and classifier training. Optimum selection and extraction parameters should be adjusted using the `GridSearchCV` class. You can choose any classifier.