# "Machine learning" with text (in Scikit-learn)

In [1]:
import unicodedata  # mekcene dlzne a divne charaktery
import re # praca s textom (regexy)
import csv # comma separated values
import json # zoberie json subor a vyrobi data
import numpy as np
unimported = """
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
"""

In [2]:
# for Python 2 users
from __future__ import print_function

## Introduction to supervised learning in scikit-learn

**From <a href="https://en.wikipedia.org/wiki/Supervised_learning">Wikipedia</a>:**<br>
**Supervised learning** is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value.

**Note:** We will consider a classification task, i.e., samples belong to two or more classes that we want to predict.

Lets load the data. If you had sklearn, we would do this:

In [3]:
_ = """
# Load the iris dataset.
from sklearn.datasets import load_iris
iris = load_iris()
"""

However we have iris dataset in data folder. Colums are: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)', 'label']

In [9]:
iris = np.array(json.load(open('omg.json')))

In [10]:
"""we will get rid of class 2 for conveniece"""
iris = iris[iris[:,-1]!=2]

In [12]:
# Let's examine the shapes of X and y.
print(iris.shape)
n_features = iris.shape[1]
n_features-1

(100, 5)


4

Lets look at the data

In [14]:
iris

array([[ 5.1,  3.5,  1.4,  0.2,  0. ],
       [ 4.9,  3. ,  1.4,  0.2,  0. ],
       [ 4.7,  3.2,  1.3,  0.2,  0. ],
       [ 4.6,  3.1,  1.5,  0.2,  0. ],
       [ 5. ,  3.6,  1.4,  0.2,  0. ],
       [ 5.4,  3.9,  1.7,  0.4,  0. ],
       [ 4.6,  3.4,  1.4,  0.3,  0. ],
       [ 5. ,  3.4,  1.5,  0.2,  0. ],
       [ 4.4,  2.9,  1.4,  0.2,  0. ],
       [ 4.9,  3.1,  1.5,  0.1,  0. ],
       [ 5.4,  3.7,  1.5,  0.2,  0. ],
       [ 4.8,  3.4,  1.6,  0.2,  0. ],
       [ 4.8,  3. ,  1.4,  0.1,  0. ],
       [ 4.3,  3. ,  1.1,  0.1,  0. ],
       [ 5.8,  4. ,  1.2,  0.2,  0. ],
       [ 5.7,  4.4,  1.5,  0.4,  0. ],
       [ 5.4,  3.9,  1.3,  0.4,  0. ],
       [ 5.1,  3.5,  1.4,  0.3,  0. ],
       [ 5.7,  3.8,  1.7,  0.3,  0. ],
       [ 5.1,  3.8,  1.5,  0.3,  0. ],
       [ 5.4,  3.4,  1.7,  0.2,  0. ],
       [ 5.1,  3.7,  1.5,  0.4,  0. ],
       [ 4.6,  3.6,  1. ,  0.2,  0. ],
       [ 5.1,  3.3,  1.7,  0.5,  0. ],
       [ 4.8,  3.4,  1.9,  0.2,  0. ],
       [ 5. ,  3. ,  1.6,

Lets examin features even more.

In [63]:
# Compute basic statistics for data : count each label, count mean + std of each feature.

""" <<CODE>>"""
ind = np.bincount( iris[:,-1].astype(int) ) # pocet kategorii (nul, jednotiek)
np.mean(iris[0:ind[0],0]) # priemer prveho stlpca s classou 0
print( iris.mean(axis=0) ) # priemery  vsetkych stlpcov
print( iris.std(axis=0) ) # standardne odchylky

[ 5.471  3.094  2.862  0.785  0.5  ]
[ 0.63848179  0.47367077  1.44130358  0.5634492   0.5       ]


## Now, lets do some classification. 
We split data to training and testing part.

In [256]:
np.random.seed(4247)
np.random.shuffle(iris)
train_count = int(iris.shape[0]*0.9)
train = iris[:train_count]
train_x = train[:,:-1]
train_y = train[:,-1]
test = iris[train_count:]
test_x = test[:,:-1]
test_y = test[:,-1]

[42 38]
[ 5.4525   3.10625  2.8075   0.7675 ]
[ 5.545  3.045  3.08   0.855]


Do we have good test set (are labels balanced?)

In [182]:
""" <<CODE>> """
print( np.bincount( train[:,-1].astype(int) ) ) 
print( train_x.mean(axis=0) )
print( test_x.mean(axis=0) )

[2 3]
[ 5.26  3.04  2.84  0.82]
[ 5.48210526  3.09684211  2.86315789  0.78315789]


Now we want to train some model, that learns how to predict last collum based on the first four.
We will test it on our test set.

In [76]:
_ = """
# Init logistic regression model with default params.
clf = LogisticRegression()

# Fit the model. 
clf.fit(train_x, train_y)
predicted = clf.predict(test_x)
"""

How could we evaluate performance of our model? What do ve care about?

In [193]:
def precision(predicted_y, true_y):
    return float((predicted_y == true_y).sum())/predicted_y.size

def L2(predicted_y, true_y):
    return 0.5 * ((predicted_y-true_y)**2).sum()


In [215]:
a=-3
np.absolute(a)

3

But first, lets make some benchmarks. What happends if we predict alway only one value?

In [258]:
# train model, that predicts only one value and evaluate its performance on test set
"""<<CODE>>"""

def klasifikator(new, train, neigbours):
    length = train.shape[0]
    dist = np.zeros((length))
    for i in range(length):
        dist[i] = np.absolute( new[0] - iris[i,0] )
    return iris[dist.argmin(),-1]

In [259]:
length = test.shape[0]
chyba = 0
for i in range(length):
    chyba += L2(klasifikator(test_x[i], train, 1), test_y[i])
print(chyba)

1.0


## Nearest neighbour classifier
Classify based on the nearest example in training set. Distance between examples is euklidean. Try cosine distance as well. 

In [162]:
"""<<CODE>>"""


0.0


Try to consider 10 nearest samples. Is it better?


In [111]:
"""<<CODE>>"""

'<<CODE>>'

Any potential problems?  Are features considered equally?

### Logistic regression
Our model will be a simple logistic regression.

We want to predict $\hat{y}$ with a following formula $\hat{y}= h(wx + b)$, where $h(x)$ is some form of nonlinearity, often sigmoid $h(x) = \frac{1}{1-e^x}$.

We wanto to find such $w$ and such $b$ that minimize the $L = \frac{1}{2}(\hat{y}-y)^2$. 

Iteratively (for a few times) update $b$ and $w$ with a $b -= \alpha \frac{\partial L}{\partial b}$, $w -= \alpha \frac{\partial L}{\partial w}$. We can do it for one example at a time, or for all examples at onse.

In [None]:
# set some alpha
# initiazet b and w on some small, positive values.

for epoch in range(100):
    # compute Los
    # update w and b
    # print Los
    pass

### Data normalization
Remember, are all features treated equaly? 
Set feature means to $0$ and standard deviations to $1$ and run above methods again.


In [None]:
""" <<CODE>> """

## What about text?

In [260]:
text_dataset = ["A coward judges all he sees by what he is.",
                "There are people who need people to need them.",
                "Never's the word God listens for when he needs a laugh."]

### Problem

We can not feed text to linear regression :(.

### Solution

We need to transform it to numbers. Each dimension is one word in wocabulary. We will ignore words that appear only once.

In [115]:
_ = """
# Init CountVectorizer with the default params.
vectorizer = CountVectorizer()
# Learn the vocabulary from the text data.
vectorizer.fit(text_dataset)
"""

In [None]:
# count each word in a dataset and encode it into "bag of words".
"""<<CODE>>"""

## Alza sentiment analysis

### Loading and preprocessing the dataset

In [262]:
alza = json.load(open('data/alza.json'))
train_count = int(len(alza)*0.8)

In [284]:
alza[]['rating']

1

In [314]:
ratings = [x['rating'] for x in alza]
texts = [x['text'] for x in alza]

In [330]:
# Since we have texts written in Czech in the dataset, let's remove the accents (diacritics) from the text first.
# This may remove come information, but you wont get an univode error.
def remove_accents(s):
    nkfd_form = unicodedata.normalize('NFKD', s)
    ascii_string = nkfd_form.encode('ASCII', 'ignore')
    return ascii_string

In [331]:
[x for x in "pocitac" if x!="c"]

['p', 'o', 'i', 't', 'a']

In [359]:
import string
string.ascii_letters
"a" in string.ascii_letters
"".join([x for x in texts[0] if x in string.ascii_letters + " "])

'Chladi brutalne oproti original chladicu ktory bol pribaleny k jadrovemu AMD FX je tichsi no stale ho je trochu pocut nic strasne to ale neni aj pri vyzsom zatazovani sa drzi pod  stupnov co je uplna parada K instalacii ako som cital ze to moze byt zlozitejsie a tak podobne tak to je uplna primitivnost s instalaciou na socket AM neboli absolutne ziadne problemy hoci obrazkovy navod nestal za moc ak ale mate IQ aspon  verim ze vam bude bohate stacit v skratke pre tych co sa boja a nechcu ist do toho na slepo kratky navod pre socket AM treba dat dole ventilator co sa spravi miernym zapacenim uchytov  srobikmi prichytime uchyty ale nezatahujeme nechame ich na volno aby sa s nimi lahko manipulovalo nasadime na procesor a uchyty nasadime na socket pred nasadenim odporucam do dosky zapojit ventilator aby ste nemali neskor problem sa k tomu konektoru dostat a ventilator polozime na stranu aby zatial nezavadzal  dotiahneme srobiky nasadime ventilator na telo chladica hotovo ja som ho mal namo

In [356]:
texts[0]

'Chladi brutalne, oproti original chladicu, ktory bol pribaleny k 4-jadrovemu AMD FX je tichsi, no stale ho je trochu pocut (nic strasne to ale neni). aj pri vyzsom zatazovani sa drzi pod 45 stupnov, co je uplna parada. K instalacii, ako som cital, ze to moze byt zlozitejsie a tak podobne, tak to je uplna primitivnost, s instalaciou (na socket AM3+) neboli absolutne ziadne problemy, hoci obrazkovy navod nestal za moc, ak ale mate IQ aspon 80, verim, ze vam bude bohate stacit. -v skratke pre tych, co sa boja a nechcu ist do toho na slepo. kratky navod (pre socket AM3+): 1.treba dat dole ventilator, co sa spravi miernym zapacenim uchytov 2. srobikmi prichytime uchyty, ale nezatahujeme, nechame ich na volno, aby sa s nimi lahko manipulovalo 3.nasadime na procesor a uchyty nasadime na socket. (pred nasadenim odporucam do dosky zapojit ventilator, aby ste nemali neskor problem sa k tomu konektoru dostat a ventilator polozime na stranu, aby zatial nezavadzal) 4. dotiahneme srobiky 5.nasadime

Tokenize text and remove strange characters, remove stop words

In [311]:
print(remove_accents("popiči"))

b'popici'


In [312]:
"""<<CODE>>"""
text_rem = remove_accents(texts)

TypeError: normalize() argument 2 must be str, not list

### Now is time to split to train and test set

In [290]:
train_text = texts[:train_count]
train_rating = ratings[train_count:]
test_text = texts[:train_count]
test_rating = ratings[train_count:]

Vectorize text. Be aware that this is part of prediction! We need to use train set for training this vectorization.

In [125]:
# transform text to numeric values
"""<<CODE>>"""

'<<CODE>>'

In [127]:
_ = """
# Initialize the CountVectorizer, this time with customized params.
vectorizer = CountVectorizer(lowercase=True,
                             ngram_range=(1,3),
                             stop_words=list(stopwords["word"].values),
                             max_df = 0.5,
                             min_df = 30,
                             tokenizer = lambda x: re.split("[\r\t\n .,;:'\"()?!/]+", x))
# Learn the vocabulary and check its size.
vectorizer.fit(text)
len(vectorizer.get_feature_names())
# Transform train data into a document-term matrix.
X_train_dtm = vectorizer.transform(text)
X_train_dtm
"""


13572

### Lets classify

Just do the same as before.

In [138]:
"<<CODE>>"

'<<CODE>>'

We can use better vectorization -> tf-idf transform. Can be tricky for sentiment analysis. "dobre" is basicaly a stop word, but it has discriminative pover.