# Machine Learning Project

Pia CHANCEREL - Raphael LASRY - Maxime POLI

Based on the article :

**A Continuation Method for Semi-Supervised SVMs**

Olivier Chapelle $\hspace{3.9cm}$ olivier.chapelle@tuebingen.mpg.de

Mingmin Chi $\hspace{4.5cm}$ mingmin.chi@tuebingen.mpg.de

Alexander Zien $\hspace{4.1cm}$ alexander.zien@tuebingen.mpg.de


Max Planck Institute for Biological Cybernetics, Tübingen, Germany

https://dl.acm.org/doi/pdf/10.1145/1143844.1143868?download=true

In [3]:
import numpy as np

# Data

20newsgroup dataset: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

Dataset of an old forum. We will only focus on messages related to windows and mac. The goal is to predict the subject of the message (windows or mac) thanks to a $S^3VM$ method.

In [25]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

print(newsgroups_train.DESCR) #Documentation of how to use the dataset

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [18]:
cat = ['comp.sys.mac.hardware', 'comp.windows.x']
newsgroups_train = fetch_20newsgroups(subset = 'train', categories = cat) 
newsgroups_test = fetch_20newsgroups(subset = 'test', categories = cat) 

In [24]:
print(newsgroups_train.filenames.shape) #Text, content of the messages
print(newsgroups_train.target.shape) #In which categories the message should be classified

(1171,)
(1171,)


In [26]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data) #From a text message to
print(vectors.shape)

(1171, 20400)


In [44]:
print(vectors) #Sparse matrix. The first index is the index of the message and the second ones are the indexes whithin this matrix where the value isn't a 0.

  (0, 9381)	0.0833319202434346
  (0, 9537)	0.051488753790363266
  (0, 7097)	0.0716017121921997
  (0, 19198)	0.03825757019634386
  (0, 13923)	0.0612567924476602
  (0, 19520)	0.0612567924476602
  (0, 13751)	0.06779317109369336
  (0, 11389)	0.06258699323079087
  (0, 6526)	0.06005984630463308
  (0, 4094)	0.040173990576588194
  (0, 13321)	0.04790996430819856
  (0, 13196)	0.04854513021356325
  (0, 18501)	0.06235389442108352
  (0, 13920)	0.05395063736099927
  (0, 10224)	0.036000680480091043
  (0, 19648)	0.15444542259960156
  (0, 15199)	0.14639735648225874
  (0, 5916)	0.06893303130389038
  (0, 4700)	0.027787419148156887
  (0, 7025)	0.08531073486179562
  (0, 8529)	0.06490899824521898
  (0, 12368)	0.03951762719939179
  (0, 178)	0.07319867824112937
  (0, 4997)	0.0716017121921997
  (0, 4864)	0.059503491097782954
  :	:
  (1170, 17575)	0.21148205930503944
  (1170, 17563)	0.015929014052134184
  (1170, 13194)	0.04351509358870771
  (1170, 10309)	0.1045700958833079
  (1170, 17572)	0.15091995023585136
  

# S3VM