# Statistical NLP

# Project Description 

 
Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. 

We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.


# Dataset 


Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at


<img src="1.jpg">

# Approach & Steps 

1.	Load the dataset (5 points)

a.	Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

2.	Preprocess rows of the “text” column (7.5 points)

a.	Remove unwanted characters

b.	Convert text to lowercase

c.	Remove unwanted spaces

d.	Remove stopwords

3.	As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)

a.	Label columns to merge: “gender”, “age”, “topic”, “sign”
b.	After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

4.	Separate features and labels, and split the data into training and testing (5 points)

5.	Vectorize the features (5 points)

a.	Create a Bag of Words using count vectorizer
i.	Use ngram_range=(1, 2)
ii.	Vectorize training and testing features

b.	Print the term-document matrix

6.	Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)

7. Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
a.	Convert your train and test labels using MultiLabelBinarizer

8.	 Choose a classifier - (5 points)

In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

a.	Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label

b.	As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that

9.	Fit the classifier, make predictions and get the accuracy (5 points)

a.	Print the following

i.	Accuracy score
ii.	F1 score
iii.	Average precision score
iv.	Average recall score
v.	Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted averaging

10.	 Print true label and predicted label for any five examples (7.5 points)


# 1. Load the dataset (5 points)

a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

In [1]:
import numpy as np
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
import nltk
import spacy
from nltk.corpus import stopwords,wordnet
from nltk.tokenize import word_tokenize, RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import string
import re
from sklearn.feature_extraction import text 
%matplotlib inline 

In [2]:
data = pd.read_csv("blogtext.csv")

In [3]:
data.columns


Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

In [4]:
data.shape

(681284, 7)

In [5]:
data['text']

0                    Info has been found (+/- 100 pages,...
1                    These are the team members:   Drewe...
2                    In het kader van kernfusie op aarde...
3                          testing!!!  testing!!!          
4                      Thanks to Yahoo!'s Toolbar I can ...
5                      I had an interesting conversation...
6                      Somehow Coca-Cola has a way of su...
7                      If anything, Korea is a country o...
8                      Take a read of this news article ...
9                      I surf the English news sites a l...
10                     Ah, the Korean language...it look...
11                     If you click on my profile you'll...
12                     Last night was pretty fun...mostl...
13                     There is so much that is differen...
14                      urlLink    Here it is, the super...
15                     One thing I love about Seoul (and...
16                      urlLink    Wonde

In [6]:
new_data = data.head(1000)

In [7]:
new_data['label'] = new_data[new_data.columns[1:5]].apply(lambda x: (','.join(x.dropna().astype(str))),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [8]:
new_data

Unnamed: 0,id,gender,age,topic,sign,date,text,label
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","male,15,Student,Leo"
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,"male,15,Student,Leo"
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,"male,15,Student,Leo"
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,"male,15,Student,Leo"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,"male,33,InvestmentBanking,Aquarius"
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...,"male,33,InvestmentBanking,Aquarius"
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...,"male,33,InvestmentBanking,Aquarius"
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o...","male,33,InvestmentBanking,Aquarius"
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...,"male,33,InvestmentBanking,Aquarius"
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...,"male,33,InvestmentBanking,Aquarius"


# 2. Preprocess rows of the “text” column (7.5 points)

a. Remove unwanted characters

b. Convert text to lowercase

c. Remove unwanted spaces

d. Remove stopwords

# 3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)

a. Label columns to merge: “gender”, “age”, “topic”, “sign” 


b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image



In [9]:
new_data = new_data.drop(columns=['gender','age','topic','sign','id','date'])

In [10]:
new_data

Unnamed: 0,text,label
0,"Info has been found (+/- 100 pages,...","male,15,Student,Leo"
1,These are the team members: Drewe...,"male,15,Student,Leo"
2,In het kader van kernfusie op aarde...,"male,15,Student,Leo"
3,testing!!! testing!!!,"male,15,Student,Leo"
4,Thanks to Yahoo!'s Toolbar I can ...,"male,33,InvestmentBanking,Aquarius"
5,I had an interesting conversation...,"male,33,InvestmentBanking,Aquarius"
6,Somehow Coca-Cola has a way of su...,"male,33,InvestmentBanking,Aquarius"
7,"If anything, Korea is a country o...","male,33,InvestmentBanking,Aquarius"
8,Take a read of this news article ...,"male,33,InvestmentBanking,Aquarius"
9,I surf the English news sites a l...,"male,33,InvestmentBanking,Aquarius"


In [11]:
print(len(new_data.text))

1000


In [12]:
new_data.text[0]

'           Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html.         '

In [13]:
nlp = spacy.load('en_core_web_sm')
l1 = ('btw','zza','zzzexy','zzzzz','youuuuu')
nlp.Defaults.stop_words.add(l1)


In [14]:
#for i in range(len(new_data.text)):
 #   new_data.text[i] = new_data.text[i].lower()

In [15]:
new_data.text

0                 Info has been found (+/- 100 pages,...
1                 These are the team members:   Drewe...
2                 In het kader van kernfusie op aarde...
3                       testing!!!  testing!!!          
4                   Thanks to Yahoo!'s Toolbar I can ...
5                   I had an interesting conversation...
6                   Somehow Coca-Cola has a way of su...
7                   If anything, Korea is a country o...
8                   Take a read of this news article ...
9                   I surf the English news sites a l...
10                  Ah, the Korean language...it look...
11                  If you click on my profile you'll...
12                  Last night was pretty fun...mostl...
13                  There is so much that is differen...
14                   urlLink    Here it is, the super...
15                  One thing I love about Seoul (and...
16                   urlLink    Wonderful oh-gyup-sal...
17                  Here is the

In [16]:
new_data.dtypes

text     object
label    object
dtype: object

In [17]:
for i in range(len(new_data.text)):
    tokenizer = RegexpTokenizer(r'\w+')
    new_data.text[i] = new_data.text[i].lower()
    word_tokens = tokenizer.tokenize(new_data.text[i])
    filtered_sentence = [w for w in word_tokens if not w in stopwords.words('english')] 
    filtered_sentence = [] 
    for w in word_tokens: 
        if w not in (nlp.Defaults.stop_words or string.punctuation):
            #if not w.isalpha():
            filtered_sentence.append(re.sub(r"[^a-zA-Z0-9]+", ' ',w ))
   
    new_data.text[i] = " ".join(filtered_sentence)
#print(word_tokens) 
#    print(new_data.text_new[i])
%time

Wall time: 0 ns


In [18]:
new_data.text

0      info found 100 pages 4 5 mb pdf files wait unt...
1      team members drewes van der laag urllink mail ...
2      het kader van kernfusie op aarde maak je eigen...
3                                        testing testing
4      thanks yahoo s toolbar capture urls popups mea...
5      interesting conversation dad morning talking k...
6      coca cola way summing things early 1970s flags...
7      korea country extremes fad based think come ko...
8      read news article urllink joongang ilbo north ...
9      surf english news sites lot looking tidbits ko...
10     ah korean language looks difficult figure read...
11     click profile ll startling discovery born year...
12     night pretty fun company kept recently met cou...
13     different ve seen haven t travelled canada phi...
14     urllink superfantastic phonebox today great da...
15     thing love seoul mean korea general happen lit...
16     urllink wonderful oh gyup sal favorite pork re...
17     latest korean rumor mill

# 4. Separate features and labels, and split the data into training and testing (5 points)

In [19]:
X = new_data.text
Y = new_data.label

In [20]:

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=524)

# 5. Vectorize the features (5 points)

a. Create a Bag of Words using count vectorizer i. Use ngram_range=(1, 2) ii. Vectorize training and testing features

b. Print the term-document matrix

### Bag of Words

An introduction to Bag of Words and how to code it in Python for NLP. Bag of Words (BOW) is a method to extract features from text documents. These features can be used for training machine learning algorithms. It creates a vocabulary of all the unique words occurring in all the documents in the training set

### Term-document matrix

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

In [21]:
# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    #feature = nb.feature_count_
    #print("NB feature shape", feature)
   # print(nb.feature_count_.shape)
    y_train_pred = nb.predict(X_train_dtm)
    y_pred_class = nb.predict(X_test_dtm)
    print('Train Accuracy for NB : ', metrics.accuracy_score(y_train,y_train_pred))
    print('Test Accuracy for NB: ', metrics.accuracy_score(y_test, y_pred_class))
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train_dtm, y_train)
    y_pred_class_LR = logreg.predict(X_test_dtm)
#print(metrics.accuracy_score(y_test, y_pred_class))
    y_train_LR = logreg.predict(X_train_dtm)
    print('Train Accuracy for LR: ',metrics.accuracy_score(y_train, y_train_LR))
    print('Test Accuracy for LR: ',metrics.accuracy_score(y_test, y_pred_class_LR))
%time

Wall time: 0 ns


In [22]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  95246
Train Accuracy for NB :  0.9733333333333334
Test Accuracy for NB:  0.584




Train Accuracy for LR:  0.9986666666666667
Test Accuracy for LR:  0.628


In [23]:
X_train_dtm = vect.fit_transform(X_train)
print('Features: ', X_train_dtm.shape[1])
X_test_dtm = vect.transform(X_test)
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
    #feature = nb.feature_count_
    #print("NB feature shape", feature)
   # print(nb.feature_count_.shape)
y_train_pred = nb.predict(X_train_dtm)
y_pred_class = nb.predict(X_test_dtm)
print('Train Accuracy for NB : ', metrics.accuracy_score(y_train,y_train_pred))
print('Test Accuracy for NB: ', metrics.accuracy_score(y_test, y_pred_class))

Features:  95246
Train Accuracy for NB :  0.9733333333333334
Test Accuracy for NB:  0.584


In [24]:
# features names
feature_names = vect.get_feature_names()
print(feature_names[50:500])

['06', '06 pm', '07', '07 41', '07 pm', '09', '09 pm', '10', '10 000', '10 11', '10 13', '10 15', '10 16', '10 20', '10 30', '10 30am', '10 45', '10 45am', '10 best', '10 clock', '10 collect', '10 comeing', '10 cut', '10 days', '10 didn', '10 difficulty', '10 dirty', '10 eats', '10 evil', '10 fancy', '10 flight', '10 freakin', '10 gig', '10 golf', '10 good', '10 government', '10 ground', '10 hole', '10 hour', '10 hr', '10 inches', '10 items', '10 koreans', '10 leave', '10 liter', '10 middle', '10 midnight', '10 mins', '10 minutes', '10 months', '10 movies', '10 muahahahaha', '10 murders', '10 mysteries', '10 nbsp', '10 night', '10 people', '10 percent', '10 pm', '10 poor', '10 pounds', '10 pple', '10 read', '10 st', '10 stop', '10 straight', '10 think', '10 thursday', '10 till', '10 times', '10 white', '10 wondering', '10 wont', '10 words', '10 years', '10 yes', '10 yr', '100', '100 000', '100 days', '100 denver', '100 facts', '100 grams', '100 hours', '100 kinds', '100 left', '100 mar

In [25]:
y_train.head(4)

334        female,24,indUnk,Scorpio
531    female,27,Education,Aquarius
447        female,24,indUnk,Scorpio
193       female,37,indUnk,Aquarius
Name: label, dtype: object

In [26]:
len(y_train)

750

# 6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)

In [27]:
d = {}

In [28]:
d = y_train.apply(lambda x : pd.value_counts(x.split(","))).sum(axis = 0).to_dict()

In [29]:
d

{'24': 248.0,
 'indUnk': 273.0,
 'female': 316.0,
 'Scorpio': 179.0,
 'Education': 59.0,
 'Aquarius': 170.0,
 '27': 59.0,
 '37': 18.0,
 'Non-Profit': 36.0,
 '25': 43.0,
 'male': 434.0,
 'Cancer': 56.0,
 '15': 59.0,
 'Student': 124.0,
 'InvestmentBanking': 55.0,
 '33': 64.0,
 'Engineering': 92.0,
 'Libra': 136.0,
 '14': 60.0,
 'Aries': 38.0,
 '17': 104.0,
 'Science': 21.0,
 '41': 9.0,
 'Communications-Media': 9.0,
 'Leo': 26.0,
 'Sagittarius': 70.0,
 '23': 50.0,
 '26': 22.0,
 'Banking': 9.0,
 'Capricorn': 55.0,
 'Sports-Recreation': 54.0,
 'Gemini': 13.0,
 'BusinessServices': 17.0,
 '45': 10.0,
 'Taurus': 5.0,
 'Arts': 1.0,
 'Pisces': 1.0,
 'Virgo': 1.0,
 '34': 3.0,
 '44': 1.0}

# 7. Transform the labels - (7.5 points) As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn a. Convert your train and test labels using MultiLabelBinarizer

### Multilabelbinarizer

Multilabelbinarizer allows you to encode multiple labels per instance. To translate the resulting array, you could build a DataFrame with this array and the encoded classes (through its "classes_" attribute). binarizer = MultiLabelBinarizer() pd.DataFrame(binarizer.fit_transform(y), columns=binarizer.classes_)

In [30]:
from sklearn.preprocessing import MultiLabelBinarizer

In [31]:
mlb = MultiLabelBinarizer()

In [32]:
y_train_mlb = mlb.fit_transform(y_train)

In [33]:
y_test_mlb = mlb.transform(y_test)

# 8. Choose a classifier - (5 points)

In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label

b. As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that

### One-vs-the-rest (OvR) multiclass/multilabel strategy

Object used to transform multiclass labels to binary labels and vice-versa. multilabel_ : boolean Whether a OneVsRestClassifier is a multilabel classifier.
One-vs-the-rest (OvR) multiclass/multilabel strategy

Also known as one-vs-all, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes. In addition to its computational efficiency (only n_classes classifiers are needed), one advantage of this approach is its interpretability. Since each class is represented by one and one classifier only, it is possible to gain knowledge about the class by inspecting its corresponding classifier. This is the most commonly used strategy for multiclass classification and is a fair default choice.

This strategy can also be used for multilabel learning, where a classifier is used to predict multiple labels for instance, by fitting on a 2-d matrix in which cell [i, j] is 1 if sample i has label j and 0 otherwise.

In the multilabel learning literature, OvR is also known as the binary relevance method.

In [50]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

In [35]:
LR = LogisticRegression(solver = 'lbfgs',random_state= 111)

In [36]:
LR

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=111, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [37]:
clf = OneVsRestClassifier(LR)

In [38]:
names = vect.get_feature_names()

In [39]:
y_train_mlb

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 1, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 1, 1, 0],
       [1, 0, 0, ..., 1, 1, 0],
       [1, 0, 0, ..., 0, 0, 0]])

# 9. Fit the classifier, make predictions and get the accuracy (5 points)
a. Print the following

i. Accuracy score 

ii. F1 score 

iii. Average precision score 

iv. Average recall score 

v. Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted averaging

In [40]:
clf.fit(X_train_dtm,y_train_mlb)
%time

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


Wall time: 0 ns


In [41]:
y_pred_clf = clf.predict(X_test_dtm)

### I.Accuracy score

In [42]:
print(metrics.accuracy_score(y_test_mlb,y_pred_clf))

0.196


### ii. F1 score

In [43]:
from sklearn.metrics import classification_report

In [44]:
print(classification_report(y_test_mlb, y_pred_clf))

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       1.00      1.00      1.00       250
           1       0.82      0.39      0.53        36
           2       0.95      0.56      0.70        68
           3       0.87      0.83      0.85       155
           4       0.93      0.31      0.46        42
           5       0.85      0.58      0.69       105
           6       0.83      0.15      0.26        33
           7       0.00      0.00      0.00        13
           8       0.86      0.37      0.52        65
           9       0.97      0.46      0.62        61
          10       1.00      0.43      0.61        23
          11       0.93      0.34      0.50        41
          12       0.92      0.45      0.61        53
          13       0.00      0.00      0.00         2
          14       1.00      0.53      0.70        15
          15       0.96      0.41      0.57        56
          16       0.00      0.00      0.00         5
          17       0.00    

### iii. Average precision score

In [45]:
metrics.average_precision_score(y_test_mlb, y_pred_clf,average='micro')

0.8155693226832911

In [46]:
#metrics.average_precision_score(y_test_mlb, y_pred_clf,average='weighted')

### iv. Average recall score

In [47]:
#metrics.recall_score(y_test_mlb, y_pred_clf)
metrics.recall_score(y_test_mlb, y_pred_clf, labels=None, pos_label=1, average='micro', sample_weight=None)

0.7673563218390804

# 10. Print true label and predicted label for any five examples (7.5 points)

In [48]:
y_pred_clf[10:15]

array([[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0],
       [1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0]])

In [49]:
y_test_mlb[10:15]

array([[1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
        0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0],
       [1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1],
       [1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0]])