REFERENCE: https://www.kaggle.com/rtatman/blog-authorship-corpus


# Context:
“A blog (a truncation of the expression "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries ("posts"). Posts are typically displayed in reverse chronological order, so that the most recent post appears first, at the top of the web page. Until 2009, blogs were usually the work of a single individual, occasionally of a small group, and often covered a single subject or topic.” -- Wikipedia article “Blog”

This dataset contains text from blogs written on or before 2004, with each blog being the work of a single user.


# Content:
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

- 8240 "10s" blogs (ages 13-17),
- 8086 "20s" blogs(ages 23-27)
- 2994 "30s" blogs (ages 33-47).

For each age group there are an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.


# Acknowledgements
The corpus may be freely used for non-commercial research purposes. Any resulting publications should cite the following:

J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging in Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs. URL: http://www.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf


# Inspiration:
- This dataset contains information on writers demographics, including their age, gender and zodiac sign. Can you build a classifier to guess someone’s zodiac sign from blog posts they’ve written?
- Which are bigger: differences between demographic groups or differences between blogs on different topics?

---

# Approach and steps
1. Load the dataset (5 points)
    - Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.
2. Preprocess rows of the “text” column (7.5 points)
    - Remove unwanted characters
    - Convert text to lowercase
    - Remove unwanted spaces
    - Remove stopwords
3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
    - Label columns to merge: “gender”, “age”, “topic”, “sign”
    - After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image
4. Separate features and labels, and split the data into training and testing (5 points)
5. Vectorize the features (5 points)
    - Create a Bag of Words using count vectorizer
        - Use ngram_range=(1, 2)
        - Vectorize training and testing features
    - Print the term-document matrix
6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)
7. Transform the labels - (7.5 points)
    As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
    - Convert your train and test labels using MultiLabelBinarizer
8. Choose a classifier - (5 points)
    In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.
    - Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label
    - As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that
9. Fit the classifier, make predictions and get the accuracy (5 points)
    - Print the following
        - Accuracy score
        - F1 score
        - Average precision score
        - Average recall score
        - Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted averaging
10. Print true label and predicted label for any five examples (7.5 points)

---

# SOLUTIONING

# STEP 1
Load the dataset (5 points)
- Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

In [1]:
import pandas as pd
import numpy as np

from tqdm import tqdm
import timeit

In [3]:
df = pd.read_csv('blogtext.csv')
df.head()

# STEP 2
Preprocess rows of the “text” column (7.5 points)
- Remove unwanted characters
- Convert text to lowercase
- Remove unwanted spaces
- Remove stopwords

In [None]:
"""Drop rows with null values"""
print("shape before drop: ", df.shape)
df.dropna()
print("shape after drop: ", df.shape)

In [None]:
"""
Apply 
- lowercase to each string column (using .lower)
- strip of any extra leading or trailing spaces (using .strip)
"""
for columnLabel in df.columns:
    df[columnLabel] = df[columnLabel].apply(lambda x: x if type(x) != str else x.lower().strip())

#df.to_csv('data_lower_striped.csv', index=False)   # Commented to re run the save to file

df.head()

# Read from lower_striped file

In [None]:
import pandas as pd
import numpy as np

from tqdm import tqdm

# df = pd.read_csv('data_lower_striped.csv')
# df.head()

In [None]:
df.info()

In [None]:
"""select only numbers, alphabets, and #+_ from text"""

from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords

def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    #review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters
    letters_only = re.sub("[^a-zA-Z0-9#+_.]", " ", raw_review) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words )) 

# 

testString = "This is a test 1.1 string testing. 123, # and ? and ~ and _ and + and others like @ ! $ % ^ * = *** ??? or @@@"
print(review_to_words(testString))

In [None]:
# # Initialize an empty list to hold the clean reviews
# #clean_train_reviews = []

# for i in tqdm(df.index):
#     # Call function for each one, and add the result to the list of clean reviews
#     inputDoc = df['text'][i]
#     #clean_train_reviews.append(review_to_words(inputDoc))
#     df['text'][i] = review_to_words(inputDoc)

# df.to_csv('data_lower_striped_onlyNumAlpha_noStopWords.csv', index=False)

# df.head()

# Read and process from data_lower_striped_onlyNumAlpha_noStopWords.csv

In [53]:
import pandas as pd
import numpy as np

from tqdm import tqdm

df = pd.read_csv('data_lower_striped_onlyNumAlpha_noStopWords.csv')
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,student,leo,"14,may,2004",info found + 100 pages 4.5 mb .pdf files wait ...
1,2059027,male,15,student,leo,"13,may,2004",team members drewes van der laag urllink mail ...
2,2059027,male,15,student,leo,"12,may,2004",het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,student,leo,"12,may,2004",testing testing
4,3581210,male,33,investmentbanking,aquarius,"11,june,2004",thanks yahoo toolbar capture urls popups...whi...


In [57]:
print(df.gender.nunique())
print(df.age.nunique())
print(df.topic.nunique())
print(df.sign.nunique())

2
26
40
12


# STEP 3

- As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
    - Label columns to merge: “gender”, “age”, “topic”, “sign”
    - After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

In [58]:
df = df.assign(temp_labels = df.gender + ', ' + df.age.astype(str) + ', ' + df.topic + ', ' + df.sign)

df.drop(['id', 'gender', 'age', 'topic', 'sign', 'date'], axis=1, inplace=True)
df.head()

# df.to_csv('data_lower_striped_onlyNumAlpha_noStopWords_ExtraColumnsRemoved.csv', index=False)

Unnamed: 0,text,temp_labels
0,info found + 100 pages 4.5 mb .pdf files wait ...,"male, 15, student, leo"
1,team members drewes van der laag urllink mail ...,"male, 15, student, leo"
2,het kader van kernfusie op aarde maak je eigen...,"male, 15, student, leo"
3,testing testing,"male, 15, student, leo"
4,thanks yahoo toolbar capture urls popups...whi...,"male, 33, investmentbanking, aquarius"


In [63]:
# #
# import collections
# my_dict = collections.defaultdict(int)

# for i in tqdm(df.index):
#     for lab in df.temp_labels[i].strip().split():
#         my_dict[lab] += 1

# # Print the dictionary
# my_dict

# STEP 4

Separate features and labels, and split the data into training and testing (5 points)

In [1]:
import pandas as pd
import numpy as np

from tqdm import tqdm
import timeit


df = pd.read_csv('data_lower_striped_onlyNumAlpha_noStopWords_ExtraColumnsRemoved.csv')
df['labels'] = df.temp_labels.apply(lambda x: x.split(', '))
df.drop(['temp_labels'], axis=1, inplace=True)
df.head()

Unnamed: 0,text,labels
0,info found + 100 pages 4.5 mb .pdf files wait ...,"[male, 15, student, leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, student, leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, student, leo]"
3,testing testing,"[male, 15, student, leo]"
4,thanks yahoo toolbar capture urls popups...whi...,"[male, 33, investmentbanking, aquarius]"


In [2]:
type(df.labels[0])

list

In [3]:
X = df.text
y = df.labels

from sklearn.model_selection import train_test_split

exclude = 0.80

# Some data will be not used
X, X_notUsed, y, y_notUsed = train_test_split(X, y, test_size=exclude)

# 20% is for testing
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2)

# 30% of remaining is for validation
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.3)

print('training records count: ', X_train.shape)
print('validation records count: ', X_val.shape)
print('testing records count: ', X_test.shape)

training records count:  (76302,)
validation records count:  (32702,)
testing records count:  (27252,)


# STEP 5

- Vectorize the features (5 points)
    - Create a Bag of Words using count vectorizer
        - Use ngram_range=(1, 2)
        - Vectorize training and testing features
    - Print the term-document matrix

In [4]:
# 6 minutes for 2000 features

start = timeit.default_timer()

from sklearn.feature_extraction.text import CountVectorizer

vectors = 5000

# Initialize the "CountVectorizer" object, which is scikit-learn's 
# bag of words tool.
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = vectors,
                             ngram_range=(1,2))

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
X_train_vectorized = vectorizer.fit_transform(X_train.astype('U'))

# Numpy arrays are easy to work with, so convert the result to an 
# array
X_train_vectorized = X_train_vectorized.toarray()

stop = timeit.default_timer()
print('Time taken: ', stop - start)

print(X_train_vectorized.shape)
# (381518, 486259)   # Full train data set for ngram_range=(1,1)
# (381518, 14735264) # Full train data set for ngram_range=(1,2)
# (381518, 46840509) # Full train data set for ngram_range=(2,3)

Time taken:  76.8528673
(76302, 5000)


In [5]:
"""
Print the term document matrix
"""

'\nPrint the term document matrix\n'

In [6]:
# 2 minutes for 2000 features

start = timeit.default_timer()


# Get a bag of words for the test set, and convert to a numpy array
X_val_vectorized = vectorizer.transform(X_val.astype('U'))
X_val_vectorized = X_val_vectorized.toarray()

stop = timeit.default_timer()
print('Time taken: ', stop - start)

Time taken:  14.019949400000002


# STEP 6
Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)

In [7]:
#
import collections
label_dict = collections.defaultdict(int)

for i in tqdm(df.index):
    #for lab in str(df.labels[i]).strip().split():
    for lab in enumerate(df.labels[i]):
        label_dict[lab] += 1

# Print the dictionary
for i in sorted (label_dict.keys()) :  
     print(i, ": ", label_dict[i])

100%|███████████████████████████████████████████████████████████████████████| 681284/681284 [00:45<00:00, 14832.47it/s]


(0, 'female') :  336091
(0, 'male') :  345193
(1, '13') :  13133
(1, '14') :  27400
(1, '15') :  41767
(1, '16') :  72708
(1, '17') :  80859
(1, '23') :  72889
(1, '24') :  80071
(1, '25') :  67051
(1, '26') :  55312
(1, '27') :  46124
(1, '33') :  17584
(1, '34') :  21347
(1, '35') :  17462
(1, '36') :  14229
(1, '37') :  9317
(1, '38') :  7545
(1, '39') :  5556
(1, '40') :  5016
(1, '41') :  3738
(1, '42') :  2908
(1, '43') :  4230
(1, '44') :  2044
(1, '45') :  4482
(1, '46') :  2733
(1, '47') :  2207
(1, '48') :  3572
(2, 'accounting') :  3832
(2, 'advertising') :  4676
(2, 'agriculture') :  1235
(2, 'architecture') :  1638
(2, 'arts') :  32449
(2, 'automotive') :  1244
(2, 'banking') :  4049
(2, 'biotech') :  2234
(2, 'businessservices') :  4500
(2, 'chemicals') :  3928
(2, 'communications-media') :  20140
(2, 'construction') :  1093
(2, 'consulting') :  5862
(2, 'education') :  29633
(2, 'engineering') :  11653
(2, 'environment') :  592
(2, 'fashion') :  4851
(2, 'government') : 

In [8]:
# # Take a look at the words in the vocabulary
# vocab = vectorizer.get_feature_names()
# print(vocab)

# import numpy as np

# # Sum up the counts of each vocabulary word
# dist = np.sum(train_data_features, axis=0)

# # For each, print the vocabulary word and the number of times it 
# # appears in the training set
# i = 0
# for tag, count in zip(vocab, dist):
#     i = i + 1
#     print(i, ": ", tag, count)

# STEP 7
Transform the labels - (7.5 points)

As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
    - Convert your train and test labels using MultiLabelBinarizer

In [9]:
from sklearn.preprocessing import MultiLabelBinarizer

lb = MultiLabelBinarizer().fit(df.labels)
print(lb.classes_)


y_train_binarized = lb.fit_transform(y_train)
y_test_binarized = lb.fit_transform(y_test)

print("\n", y_train_binarized.shape)
print(y_test_binarized.shape)

['13' '14' '15' '16' '17' '23' '24' '25' '26' '27' '33' '34' '35' '36'
 '37' '38' '39' '40' '41' '42' '43' '44' '45' '46' '47' '48' 'accounting'
 'advertising' 'agriculture' 'aquarius' 'architecture' 'aries' 'arts'
 'automotive' 'banking' 'biotech' 'businessservices' 'cancer' 'capricorn'
 'chemicals' 'communications-media' 'construction' 'consulting'
 'education' 'engineering' 'environment' 'fashion' 'female' 'gemini'
 'government' 'humanresources' 'indunk' 'internet' 'investmentbanking'
 'law' 'lawenforcement-security' 'leo' 'libra' 'male' 'manufacturing'
 'maritime' 'marketing' 'military' 'museums-libraries' 'non-profit'
 'pisces' 'publishing' 'realestate' 'religion' 'sagittarius' 'science'
 'scorpio' 'sports-recreation' 'student' 'taurus' 'technology'
 'telecommunications' 'tourism' 'transportation' 'virgo']

 (76302, 80)
(27252, 80)


# STEP 8
Choose a classifier - (5 points)
- In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.
        a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label
        b. As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that
            
            # CODE ---START---
            from sklearn.multiclass import OneVsRestClassifier
            from sklearn.linear_model import LogisticRegression

            logReg = LogisticRegression(solver='lbfgs')
            clf = OneVsRestClassifier(logReg)
            # CODE ---COMPLETE---

In [10]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
# import timeit

iterations = 300

logReg = LogisticRegression(solver='lbfgs', max_iter=iterations, verbose=1)
clf = OneVsRestClassifier(logReg)

# STEP 9
Fit the classifier, make predictions and get the accuracy (5 points)
- Print the following
    - Accuracy score
    - F1 score
    - Average precision score
    - Average recall score
    - Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted averaging

In [11]:
start = timeit.default_timer()

"""
Fit the classifier
"""
clf.fit(X_train_vectorized, y_train_binarized)

stop = timeit.default_timer()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.4min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.5min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.1min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   42.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   47.9s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.5min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.8min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.5min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.5min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.5min finished
[Parallel(n_jobs=1)]: Us

In [12]:
print('Time taken: ', stop - start)
# @500 vectors (10% records)
# 199 for 50
# 324 for 100
# 434 for 150
# 509 for 200

# @1000 vectors (10% records)
# 658 for 200

# @2000 vectors (20% records)
# 2301 for 200

# @10000 vectors (20% records)
# 12966 for 200

# @5000 vectors (20% records)
# 8606 for 300

Time taken:  8606.1222725


In [22]:
"""
Write logic to get label with max probability in each category
"""

label_label_dict = collections.defaultdict(int)
for i in sorted (label_dict.keys()):
    label_label_dict[i[1]] += i[0]

"""
Take probabilities for each class as input, and, output classes with max probability in each category.
"""
def topPredictions(arrayOfProb):
    out = ['', '', '', '']
    out_prob = [0, 0, 0, 0]
    for i in range(0, arrayOfProb.shape[0]):
        item_i_prob = arrayOfProb[i] # Probability
        item_i = lb.classes_[i] # Class label
        label_label = label_label_dict[item_i] # Labels label (0, 1, 2, 3)
        
        if item_i_prob > out_prob[label_label]:
            out_prob[label_label] = item_i_prob
            out[label_label] = item_i
            #print(item_i, " of type ", label_label, " has probability: ", item_i_prob)
    return out, out_prob

"""
Take vectorized X as input.
Do prediction using the classifier.
Find the class with max probability in each category.
"""
def predictLabels(X_input_vectorized):
    y_input_predict = clf.predict_proba(X_input_vectorized)
    y_output_predict = []
#     for i in range(0, len(y_input_predict)):
    for i in tqdm(range(0, len(y_input_predict))):
        y_temp_i, y_temp_i_prob = topPredictions(y_input_predict[i])
        y_output_predict.append(y_temp_i)
    
    return y_output_predict

In [23]:
"""
Predict
"""
y_val_predict = predictLabels(X_val_vectorized)

100%|██████████████████████████████████████████████████████████████████████████| 32702/32702 [00:06<00:00, 5134.59it/s]


In [27]:
# """
# Accuracy
# """
# # predictions = clf.predict(x_test)
# score = clf.score(X_val_vectorized, y_val)
# print(score)

# STEP 10
Print true label and predicted label for any five examples (7.5 points)

In [30]:
y_val_predict[0:5]

[['male', '17', 'student', 'taurus'],
 ['male', '16', 'indunk', 'libra'],
 ['female', '48', 'marketing', 'aries'],
 ['male', '35', 'student', 'aquarius'],
 ['female', '23', 'indunk', 'taurus']]

In [39]:
print(y_val[0:5])

426634            [female, 24, arts, libra]
538641    [female, 25, engineering, taurus]
205083           [male, 16, student, virgo]
635379     [female, 23, student, capricorn]
62141          [female, 16, student, aries]
Name: labels, dtype: object


# EXTRA
Save model

In [25]:
import pickle

modelPrefix = 'model_clf_'
tfidfPrefix = 'vectors_'
filename = str(exclude) + "exclude_" + str(vectors) + "Vectors_" + str(iterations) + "Iterations.sav"
print(filename)
# filename = "model_clf.sav"
pickle.dump(clf, open(modelPrefix + filename, 'wb'))
pickle.dump(X_train_vectorized, open(tfidfPrefix + filename, 'wb'))

0.8exclude_5000Vectors_300Iterations.sav


In [18]:
# Testing loading the model

loaded_clf = pickle.load(open("model_clf_20PerData_5000Vectors_200Iterations.sav", 'rb'))

temp, temp_prob = topPredictions(loaded_clf.predict_proba(np.reshape(X_val_vectorized[4], (1, -1)))[0])
print(temp)

['male', '26', 'indunk', 'aries']
