### Zadanie 1
W `lab4.ipynb` jest przykładowa próba stworzenia klasyfikatora dla problemu klasyfikacji wiadomości do jednej z 20 grup tematycznych (zbiór: `20newsgroups`). Postaraj się uzyskać możliwie wysoką dokładność testową poprawiając **pipeline**, testując różne modele, stosując **GridSearch**, rozważając użycie **n-gramów** (ngram_range), itp. Za zadanie możesz otrzymać $x − 80$ punktów, gdzie $x$ to Twoja średnia dokładność testowa (w procentach) uzyskana z walidacji krzyżowej (k-Fold, OOB).

In [1]:
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train', shuffle=True)
train.target_names #prints all the categories

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [2]:
print('Train set size: %s ' % train.target.size)

Train set size: 11314 


In [3]:
print('FIRST TEXT CATEGORY: %s \n\n' % train.target_names[train.target[0]])
print('FIRST TEXT: \n')
print('\n'.join(train.data[0].split("\n")))

FIRST TEXT CATEGORY: rec.autos 


FIRST TEXT: 

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [17]:
import numpy as np
np.set_printoptions(precision=2)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Using DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# dtc = DecisionTreeClassifier().fit(X_train_tfidf, train.target)

from sklearn.model_selection import cross_val_score

In [5]:
# We can write less code and do all of the above, by building a pipeline.
# The names ‘vect’ , ‘tfidf’ and ‘clf’ are arbitrary.
# The purpose of the pipeline is to assemble several steps that can be
# cross-validated together while setting different parameters.

from sklearn.pipeline import Pipeline

pipe_clf = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()), 
    ('dtc', DecisionTreeClassifier())
])

# Now we can use orginal dataset train.data
pipe_clf = pipe_clf.fit(train.data, train.target)

In [6]:
# Performance of DecisionTreeClassifier
test = fetch_20newsgroups(subset='test', shuffle=True)
predicted = pipe_clf.predict(test.data)
accurracy = np.mean(predicted == test.target)

# is the result realy bad?
print('Accuracy: %s' % accurracy)

Accuracy: 0.5531067445565587


In [7]:
# Create a list of parameters and their values to be checked.
# All the parameters name are of the form 'stepName__paramName'.
# E.g. 'vect__ngram_range': [(1, 1), (1, 2)]
# that means use unigram and bigrams and choose the one which is optimal.

parameters = {
    'vect__ngram_range': [(1, 2)],  
    'tfidf__use_idf': (True, False),
#     'dtc__max_depth': (20,40)
}

In [8]:
#BELOW COMMANDS ARE TIME EXPENSIVE!

# n_jobs=-1 means using all cores
# Perheps you may need to run "conda install -c anaconda joblib" 

from sklearn.model_selection import GridSearchCV

gs_clf = GridSearchCV(pipe_clf, parameters, n_jobs=-1)

# Run the grid search on the pipeline
gs_clf = gs_clf.fit(train.data, train.target)
print("Best score: %s" % gs_clf.best_score_) 
print("Best param: %s" % gs_clf.best_params_) 

Best score: 0.6393840402617277
Best param: {'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}


In [11]:
from sklearn.svm import LinearSVC

pipeline_lsvc = Pipeline([
    ('vect', CountVectorizer()), 
    ('tfidf', TfidfTransformer()),
    ('lsvc', LinearSVC())
])

pipe_lsvc = pipeline_lsvc.fit(train.data, train.target)

parameters_lsvc = {
    'vect__ngram_range': [(1, 2)],  
    'tfidf__use_idf': (True, False),
    'lsvc__C': (0.1, 1, 10)
}

gs_lsvc = GridSearchCV(pipe_lsvc, parameters_lsvc, n_jobs=-1)

# Run the grid search on the pipeline
gs_lsvc = gs_lsvc.fit(train.data, train.target)
print("Best score: %s" % gs_lsvc.best_score_) 
print("Best param: %s" % gs_lsvc.best_params_)

Best score: 0.9307937672619891
Best param: {'lsvc__C': 10, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


In [19]:
pipeline_lsvc_best = Pipeline([
    ('vect', CountVectorizer(ngram_range=(1, 2))), 
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('lsvc', LinearSVC(C=10))
])

pipe_lsvc_best = pipeline_lsvc.fit(train.data, train.target)

# Perform k-Fold cross-validation
scores = cross_val_score(pipe_lsvc_best, test.data, test.target, cv=5)
# Calculate and print the mean accuracy
mean_accuracy = np.mean(scores)
print("Mean cross-validated accuracy: %s" % mean_accuracy)

Mean cross-validated accuracy: 0.9113127670693031


#### Dołożony stemming i usunięcie stopwords

In [14]:
import nltk
nltk.download('snowball_data')
nltk.download('stopwords')

from nltk.corpus import stopwords

[nltk_data] Downloading package snowball_data to
[nltk_data]     /home/dominik/nltk_data...
[nltk_data]   Package snowball_data is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dominik/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english", ignore_stopwords=True)
print('running --> %s' % stemmer.stem("running"))
print('generously --> %s' % stemmer.stem("generously"))

running --> run
generously --> generous


In [16]:
# Use stemming in the vectorization process

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])
    

pipe_stemmed = Pipeline([
    ('vect', StemmedCountVectorizer(stop_words='english', ngram_range=(1, 2))),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('lsvc', LinearSVC(C=10))
])

pipe_stemmed = pipe_stemmed.fit(train.data, train.target)

predicted_stemmed = pipe_stemmed.predict(test.data)
print('Accuracy after stemming: %s' % np.mean(predicted_stemmed == test.target))

# Perform k-Fold cross-validation on TRAIN set
scores = cross_val_score(pipe_stemmed, train.data, train.target, cv=5)
# Calculate and print the mean accuracy
mean_accuracy = np.mean(scores)
print("Mean cross-validated accuracy (Train): %s" % mean_accuracy)

Accuracy after stemming: 0.8604620286776421
Mean cross-validated accuracy: 0.9292030758134648


In [20]:
# Perform k-Fold cross-validation on TEST set
scores = cross_val_score(pipe_stemmed, test.data, test.target, cv=5)
# Calculate and print the mean accuracy
mean_accuracy = np.mean(scores)
print("Mean cross-validated accuracy (Test): %s" % mean_accuracy)

Mean cross-validated accuracy (Test): 0.9167561560878802
