# Implement Jaccard score for 5/20 news groups 

# Project scope

## Requiremets for result
### Build a comparison table and analyse: 
1. The most efficient number of words per class profile (20, 50, 100)
2. The best classifier
3. The best combiation of both

## [Imports](#imports)
 **Import necessary libs and objects from skleran**
 
### Dataset
- fetch_20newsgroups (select 5 from 20)

### Metrics
- confusion_matrix
- classification_report
- accuracy_score
- jaccard_similarity_score

### Classifier
- SGDClassifier
- LogisticRegression
- KNeighborsClassifier

### Vectorizer
- CountVectorizer
- TfidfTransformer

## General algorithm 

### [Prepare row sets](#prep)
1. Unpack grops of selected categories separetely
2. Remove: headers, footers, quotes
3. Split to train and test sets

### [Vectorize sets](#vect)
**for each set**
- remove stop words
- calculate count score (fit, transform)
- calculate TF-IDF score (fit, transform)

### [Build Jaccard profiles](#bjp)
After basic vectorizstion and cleaning data calculate jaccard score for each word. 
Based on requirements build a profiles for each class

### [Train classifier](#train)
1. Combine all profiles to single vector
2. Train classifier on each calculated class profile (5 classe by 3 profile: 20, 50, 100 words)

### [Test prediction](#pred)
Run a predict method to predict values using trained classifier

### [Calculate accuracy](#report)
Using imported metrics calculate accurasy of each cobmination of classifier-class profile
Calsulated info store in a disct:

```
{
    'class_crofile_name' : {
        'valume_of_profile': number_of_words,
        'accuracy': accuracy_of_prediction,
    }
}
```

## Import libs and dataset<a id="imports"></a>

In [200]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, jaccard_similarity_score
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline


## Import 20 news groups dataset and select 5 gorups <a id="prep"></a>

In [201]:
datasets = []
categories = ['sci.space', 'comp.graphics', 'talk.politics.misc', 'rec.sport.hockey', 'comp.sys.mac.hardware'] 
remove = ('headers', 'footers', 'quotes')
# twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories = categories, remove = remove )
# twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42, categories = categories, remove = remove )

for category in categories:
    train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories = [category], remove = remove )
    test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42, categories = [category], remove = remove )
    datasets.append((train, test))
print(len(datasets))

5


## Vectorization <a id="vect"></a>

In [202]:
vectores = []
for train, test in datasets:
    vect = CountVectorizer(max_features=100,stop_words = 'english')
    vect.fit(train.data)
    train_data = vect.transform(train.data)
    test_data = vect.transform(test.data)
    vectores.append((train_data, test_data))



In [203]:

tfidf_vectores = []
for target, data in enumerate(vectores):
    tfidf = TfidfTransformer(use_idf = True).fit(data[0])
    train_data_tfidf = tfidf.transform(data[0])
    tfidf_vectores.append((train_data_tfidf, str(target)))

In [204]:
for id, train_data_tfidf in enumerate(tfidf_vectores): 
    x = list(zip(vect.get_feature_names(), np.ravel(train_data_tfidf[0].sum(axis=0))))
    def SortbyTF(inputStr):
        return inputStr[1]
    x.sort(key=SortbyTF, reverse = True)
    print(id)
    print (x[:10])
    print(len(x))


0
[('simms', 63.638236720259407), ('info', 40.217516084378431), ('got', 34.489623755977327), ('meg', 28.974849998822815), ('think', 28.337530277047314), ('computer', 27.555136319539837), ('hard', 25.862687158568772), ('time', 24.214750077698607), ('nubus', 23.623757516146643), ('need', 22.128767174173799)]
100
1
[('hard', 37.584925681349404), ('support', 32.852919317138479), ('machine', 30.786586539794023), ('cpu', 28.24911095519138), ('info', 25.856203116615056), ('macs', 25.612822548625736), ('time', 24.688387491054481), ('drive', 24.669020709429397), ('drives', 24.240591856766716), ('plus', 23.668381802928767)]
100
2
[('modem', 42.89142619285699), ('cd', 28.872638336374333), ('disk', 27.732359486938975), ('external', 26.007861078641259), ('support', 23.939126792300943), ('good', 23.59494928308845), ('hardware', 20.737195960640889), ('cache', 20.348902965458109), ('mb', 19.729841803797793), ('just', 19.50207281270664)]
100
3
[('mb', 52.349835237372403), ('used', 48.657122613379769), 

In [1]:
# for id, train_data_tfidf in enumerate(tfidf_vectores): 
#     print(train_data_tfidf[0])

## Build Jaccard profiles <a id="bjp"></a>

In [206]:
"""Here will be my function"""

'Here will be my function'

## Train Classifier <a id="train"></a> ...work in progress...

In [207]:
final_train_data = []
target = []
for data, t in tfidf_vectores:
    final_train_data += data
    target

In [209]:
clf = SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)

In [211]:
clf.fit(final_train_data, target)

## Make Predictions <a id="pred"></a>

In [None]:
prediction = clf.predict(test_data,)

## Reporting <a id="report"></a>

In [None]:
np.mean(prediction == twenty_test.target)

# Training without Jaccard

In [188]:
# categories = ['sci.space',] 
categories = ['sci.space', 'comp.graphics', 'talk.politics.misc', 'rec.sport.hockey', 'comp.sys.mac.hardware'] 
remove = ('headers', 'footers', 'quotes')
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories = categories, remove = remove )
twenty_test = fetch_20newsgroups(subset='test', shuffle=True, random_state=42, categories = categories, remove = remove )


In [189]:
vect = CountVectorizer(stop_words = 'english')

In [190]:
vect.fit(twenty_train.data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [191]:


train_data = vect.transform(twenty_train.data)
test_data = vect.transform(twenty_test.data)


In [192]:
tfidf = TfidfTransformer(use_idf = True).fit(train_data)
train_data_tfidf = tfidf.transform(train_data)


In [193]:
x = list(zip(vect.get_feature_names(), np.ravel(train_data.sum(axis=0))))
def SortbyTF(inputStr):
    return inputStr[1]
x.sort(key=SortbyTF, reverse = True)
print (x[:10])
print(len(x))


[('space', 1086), ('don', 962), ('like', 953), ('just', 896), ('know', 871), ('think', 849), ('people', 833), ('time', 771), ('new', 750), ('10', 692)]
31002


In [194]:
clf = SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, random_state=42,max_iter=5, tol=None)

In [195]:
clf.fit(train_data_tfidf, twenty_train.target)

SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False)

In [196]:
prediction = clf.predict(test_data,)

In [197]:
prediction[:10]

array([3, 1, 3, 4, 3, 2, 1, 2, 2, 2])

In [198]:
print (twenty_test.target[:10])

[3 0 3 4 3 2 1 2 2 2]


In [199]:
np.mean(prediction == twenty_test.target)

0.85935002663825255