## Text Classification

There are several types of classification:

- Binary : 2 mutually exclusive categories (Detecting spam etc)
- Multiclass: More than 2 mutually exclusive categories (Language detection etc)
- Multilabel: non-mutually exclusive categories (like movie genres, tV shows etc)

### Binary text classification problem



In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [2]:
# Train and test data set

train_data = ['Football: a great sport', 
              'The referee has been very bad this season', 
              'Our team scored 5 goals', 'I love tenis',
              'Politics is in decline in the UK', 
              'Brexit means Brexit', 
              'The parlament wants to create new legislation',
              'I so want to travel the world']

train_labels = ["Sports","Sports","Sports","Sports", 
                "Non Sports", "Non Sports", "Non Sports", "Non Sports"]

test_data = ['Swimming is a great sport', 
             'A lot of policy changes will happen after Brexit', 
             'The table tenis team will travel to the UK soon for the European Championship']
test_labels = ["Sports", "Non Sports", "Sports"]

In [3]:
# Representation of data using Tf-IDF
vectorizer = TfidfVectorizer()
vectorized_train_data = vectorizer.fit_transform(train_data)
vectorized_test_data = vectorizer.transform(test_data)

In [4]:
# Train the classifier given the training data
classifier = LinearSVC()
classifier.fit(vectorized_train_data, train_labels)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [5]:
# Predict the labels for the test documents 
print(classifier.predict(vectorized_test_data))

['Sports' 'Non Sports' 'Non Sports']


### Nice. We build our text classifier :)
- Matching problems
- Cases never seen below
- "Spurious" correlations and bias ("car" appears only in the +ve category)

In [10]:
from pprint import pprint # This way we print pretty :)

def feature_values(doc, representer):
    doc_rep = representer.transform([doc])
    features = representer.get_feature_names()
    return [(features[index], doc_rep[0, index]) for index in doc_rep.nonzero()[1]]
pprint([feature_values(doc, vectorizer) for doc in test_data])

[[('sport', 0.57735026918962584),
  ('is', 0.57735026918962584),
  ('great', 0.57735026918962584)],
 [('brexit', 1.0)],
 [('uk', 0.34666892278432909),
  ('travel', 0.34666892278432909),
  ('to', 0.29053561299308733),
  ('the', 0.6594480187891556),
  ('tenis', 0.34666892278432909),
  ('team', 0.34666892278432909)]]


### Let's try with remove with stop-word 

In [14]:
from nltk.corpus import stopwords

# Load the list of english / stop words from nltk
stop_words = stopwords.words("english")

# Represent, train, predict and print it out
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectorized_train_data = vectorizer.fit_transform(train_data)
vectorized_test_data = vectorizer.transform(test_data)

# Assign SVC classifier

classifier = LinearSVC()

# fit the classifier with vectorized train data set and their labels.

classifier.fit(vectorized_train_data, train_labels)

# Lets print and see what comes out, should give a Sports, Non Sports, Sports

print(classifier.predict(vectorized_test_data))

['Sports' 'Non Sports' 'Sports']


### Ok, cool.

### Multi-Class Classification Challenge

Here lets address the multi-class problem of detecting the language of a sentence based on 3 mutually exclusive languages such as English, Spanish, French. Lets assume that we can only have three languages that the documents can contain.

So, lets get on and create a sample artificial text...

In [15]:
train_data = ['PyCon es una gran conferencia', 
              'Aprendizaje automatico esta listo para dominar el mundo dentro de poco',
             'This is a great conference with a lot of amazing talks', 
              'AI will dominate the world in the near future',
             'Dix chiffres por resumer le feuilleton de la loi travail']
train_labels = ["SP", "SP", "EN", "EN", "FR"]
test_data = ['Estoy preparandome para dominar las olimpiadas', 
             'Me gustaria mucho aprender el lenguage de programacion Scala',
             'Machine Learning is amazing',
             'Hola a todos']
test_labels = ["SP", "SP", "EN", "SP"]

# Representation
vectorizer = TfidfVectorizer()
vectorized_train_data = vectorizer.fit_transform(train_data)
vectorized_test_data = vectorizer.transform(test_data)

# Training
classifier = LinearSVC()
classifier.fit(vectorized_train_data, train_labels)

# Predicting
predictions = classifier.predict(vectorized_test_data)
pprint(predictions)

array(['SP', 'SP', 'EN', 'EN'],
      dtype='<U2')


###  So, what happened above?