# Semi Supervised Learning Method 
### By Taiki Papandreou

## Quick Recap
Semi Supervised learning model is trained on both labeled and unlabled data. Since we manually labled our data in week 2. We can implement this technique in week 3. <br>
The goal of this model is to get a better result than lbl2vec. <br>
Source: https://machinelearningmastery.com/semi-supervised-learning-with-label-propagation/


In [14]:
# import
import pandas as pd
import random
from sklearn import datasets
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import itertools
import numpy as np
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation


# Data

In [2]:
df = pd.read_csv('0-10000-labeled.csv')
df[:5]

Unnamed: 0,article,highlights,id,label
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4,0.0
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9,0.0
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",06352019a19ae31e527f37f7571c6dd7f0c5da37,1.0
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",24521a2abb2e1f5e34e6824e0f9e56904a2b0e88,0.0
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a,0.0


In [3]:
# Inspect some articles with label
check0 = df[['highlights', 'label']]
n = 3
check1 = check0.sample(n)
pd.options.display.max_colwidth = 300
print(check1.to_string(index=False, header=False))

# It seems label 0 is working


                                                                                              Retired Adm. Dennis Blair confirmed by unanimous consent .\nConfirmation comes after previous director Michael McConnell resigned .\nEarlier this month, President Obama nominated Blair as chief of intelligence . 0.0
                     FBI looking for financial impropriety after man's clients talked about money loss .\nPolice say William Parente killed wife, two daughters in hotel room .\nFamily ID'd as William and Betty Parente, Stephanie, 19, Catherine, 11 .\nPolice say Parente, an attorney, fatally cut himself . 0.0
8-year-old girl sexually assaulted by fellow Liberia natives, police say .\nDuring Liberia's civil war, rape was used as a weapon by soldiers .\nU.N. report: 60 to 70 percent of Liberian women were assault victims .\nJohnson-Sirleaf, first elected female leader in Africa, makes stopping rape a priority . 0.0


# Data preparation

In [4]:
# define dataset
# We only use highlights in this notebook since it gives enough information on articles
df1 = check0
print(df1.shape)
print(df1.columns)

(10000, 2)
Index(['highlights', 'label'], dtype='object')


In [5]:
# merge with 10000 unlabeled data
# load that data
df2 = pd.read_csv('277113-287113-unlabeled.csv')
df2 = df2[['highlights', 'label']]
df2[:2]

Unnamed: 0,highlights,label
0,"Skier Raphael Beghi captured fall via a helmet-mounted camera .\nAs he sped down the hill, he lost control of his skis and they flew off .\nHe quickly crashed to the ground and skidded to a stop in Serre Chevalier .",
1,Two women were arrested for 'intruding' during the Australian Open Final .\nThey ran onto the court to protest the poor living conditions and treatment of refugees detained on Manus Island immigration detention facility .\nNovak Djokovic went on to triumph over Andy Murray in the Men's Final .\n...,


In [6]:
# merge them together
frames = [df1, df2]
data = pd.concat(frames, ignore_index=True)
# The result should have ['highlights','label'] and has 20k records
print(data.shape)
print(data.columns)

(20000, 2)
Index(['highlights', 'label'], dtype='object')


In [7]:
%%time
# We need to have a vector representation of highlights
# In this project we only look at words and not grammar
# So using other text representation might improve the performance

###########################################################################

# We will only look at nouns since those words give semantics
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# lemmatizer for later
Lem = WordNetLemmatizer()
for i in range(len(data)):
    # extract only nouns from highlights
    lines = data['highlights'][i]
    # tokenize highlights
    tokenized = nltk.word_tokenize(lines)
    nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 
    # make them lowercase
    nouns = [noun.lower() for noun in nouns]
    # lemmatization to erase prural etc
    nouns = [Lem.lemmatize(noun) for noun in nouns]
    # update 
    data.at[i,'highlights']= nouns

CPU times: user 1min 18s, sys: 1.45 s, total: 1min 19s
Wall time: 1min 20s


In [8]:
# check how many unique words we have
words = list(itertools.chain(data['highlights']))
merged = list(itertools.chain(*words))
# remove all duplicates now we defined vocabulary (noun only) of our data
vocab = set(merged)
print("This dataset contains ", len(vocab), "nouns.")

This dataset contains  34725 nouns.


In [9]:
sentences = [' '.join(list) for list in data['highlights']]
vectorizer = TfidfVectorizer(norm = False, smooth_idf = False)
sentence_vectors = vectorizer.fit_transform(sentences)
print(sentence_vectors.shape)

(20000, 31676)


In [10]:
%%time
X = [sentence_vectors[i].toarray()[0] for i in range(len(data))]

CPU times: user 2.99 s, sys: 709 ms, total: 3.7 s
Wall time: 3.85 s


In [11]:
X = np.array(X)
print(type(X))
print(np.unique(X[0]))

<class 'numpy.ndarray'>
[ 0.          4.58632287  4.75930192  4.82126864  4.9633163   5.63562939
  5.92675381  6.0359531   6.23065872  6.47267075  6.57275421  6.99146455
  7.46950035  8.19543735 15.00458034 17.91515481]


In [12]:
# because it cannot process NaN I set them to -1
print(data['label'].unique())
data['label'] = data['label'].replace(np.NaN, -1)
print(data['label'].unique())
y = np.array(list(data['label']))

[ 0.  1. nan]
[ 0.  1. -1.]


# Label Propagation Algorithm

Label Propagation is a semi-supervised learning algorithm.


In [15]:
%%time
# Now that we have our data in a good format we can start with label propagation algorithm
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
y_train_mixed = concatenate((y_train_lab, y_test_unlab))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Accuracy: 47.660


  probabilities /= normalizer


It performs a little bit better than lbl2vec.

In [22]:
data['highlights'][42]

['soldier',
 'battle',
 'islamic',
 'insurgent',
 'people',
 'body',
 'hand',
 'foot',
 'wire',
 'sheet',
 'plastic',
 'incident',
 'dragging',
 'u.s.',
 'soldier',
 'street',
 'mogadishu',
 'washington',
 'somalia',
 'haven',
 'terrorist']

## How to improve the accuracy
- better labeled dataset (we didn't make criteria list to label our data => subjective and vague label 1)
- better way to represent text (Word2Vec?)
- different models
