# How to process text for ML algorithms

Text cannot be used as it is for input to ML models

We need some way of representing text in a numerical format so that a model can understand it.

Typically, sentences are broken down into tokens (words)

Then each token is encoded as integer or floating-point values for use as input to a machine learning algorithm. This is called **feature extraction**

### Bag-of-Words Model (BoW)

BoW is a popularly used model for representing text documents.

This model focuses simply on the occurence information of words in documents and not on any order information.

This encoding represents what words are present and to what degree they are present. 


In [1]:
Vocabulary = {'how':0, 'old':1, 'thank':2, 'are':3, 'you':4, 'very':5, 'now':6, 'much':7}

doc1 = "how old are you now"
doc2 = "thank you very much"

doc1_tokens = ["how", "old", "are", "you", "now"]
doc2_tokens = ["thank", "you", "very", "much"]

docs = [ [1, 1, 0, 1, 1, 0, 1, 0], 
         [0, 0, 1, 0, 1, 1, 0, 1] ]

Scikit-learn library provides 3 different schemes for achieving BoW.

#### CountVectorizer

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

{u'brown': 0, u'lazy': 4, u'jumped': 3, u'over': 5, u'fox': 2, u'dog': 1, u'quick': 6, u'the': 7}
(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]


In [3]:
text2 = ["the fox is brown"]
vector = vectorizer.transform(text2)
print(vector.toarray())

[[1 0 1 0 0 0 0 1]]


#### TfidfVectorizer

Some words occur too frequently!

For ex : a, an, the, is, for, to ...

However, these words need not add any value to the content of the text documents. 

To solve, this TF-IDF is very popularly used. 

TF-IDF is an acronym for **Term Frequency - Inverse Document Frequency**

**Term Frequency**: This summarizes how often a given word appears within a document

**Inverse Document Frequency**: This downscales words that appear a lot across documents.

It is a score attributed to each word which takes into account its frequency while discounting its document frequency

TF(t) = No of times term t apprears in a document / Total no of terms in the document

IDF(t) = ln(Total no of documents / No of documents with term t in it)

TF-IDF(t) = TF(t) * IDF(t)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
        "brown dog",
        "lazy fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

{u'brown': 0, u'lazy': 4, u'jumped': 3, u'over': 5, u'fox': 2, u'dog': 1, u'quick': 6, u'the': 7}
[1.28768207 1.28768207 1.28768207 1.69314718 1.28768207 1.69314718
 1.69314718 1.69314718]
(1, 8)
[[0.24920411 0.24920411 0.24920411 0.32767345 0.24920411 0.32767345
  0.32767345 0.65534691]]


## Salary prediction given job description and location

In [13]:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression, Ridge 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from scipy.sparse import hstack
from sklearn.metrics import mean_squared_error

  from numpy.core.umath_tests import inner1d


In [17]:
train_file = 'salary_dataset/salary-train.csv'
test_file = 'salary_dataset/salary-test-mini.csv'

In [18]:
def get_data(filename):
    df = pandas.read_csv(filename)
    # For beginning, transform train['FullDescription'] to lowercase using text.lower()
    # Then replace everything except the letters and numbers to spaces.
    # it will facilitate the further division of the text into words.
    df['FullDescription'] = df['FullDescription'].str.lower().replace('[^a-zA-Z0-9]', ' ', regex = True)
    
    # Replace NaN in LocationNormalized and ContractTime rows to special string 'nan'. 
    df['LocationNormalized'].fillna('nan', inplace=True)
    df['ContractTime'].fillna('nan', inplace=True)
    return df

In [19]:
train = get_data(train_file)
train = train[0:1000]
train.head()

Unnamed: 0,FullDescription,LocationNormalized,ContractTime,SalaryNormalized
0,international sales manager london k ...,London,permanent,33000
1,an ideal opportunity for an individual that ha...,London,permanent,50000
2,online content and brand manager luxury reta...,South East London,permanent,40000
3,a great local marketleader is seeking a perman...,Dereham,permanent,22500
4,registered nurse rgn nursing home for young...,Sutton Coldfield,,20355


In [20]:
test = get_data(test_file)
test.head()

Unnamed: 0,FullDescription,LocationNormalized,ContractTime,SalaryNormalized
0,we currently have a vacancy for an hr project ...,Milton Keynes,contract,
1,a web developer opportunity has arisen with an...,Manchester,permanent,


In [52]:
# Convert the collection of raw documents to a matrix of TF-IDF features with TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5)
X_train_tfidf = vectorizer.fit_transform(train['FullDescription'])
print(X_train_tfidf)

# Converts the LocationNormalized and ContractTime feature of all records to a list of dictionaries
# Eg: [ {'LocationNormalized': 'London', 'Contract': 'Permanent'}, 
# {'LocationNormalized': 'London', 'Contract': 'Permanent'} ..]
features = train[['LocationNormalized', 'ContractTime']].to_dict('records')
print(features)

#Transforms lists of feature-value mappings to vectors
enc = DictVectorizer()
X_train_categ = enc.fit_transform(features)
print X_train_categ[0:5]

# Take a sequence of arrays and stack them horizontally to make a single array. 
# Rebuild arrays divided by scipy.sparse.hstack. 
# Note that matrices are sparse. 
# In numerical analysis, a sparse matrix is a matrix in which most of the elements are zero. 
X_train = hstack([X_train_tfidf,X_train_categ])

  (0, 1559)	0.21428189918183788
  (0, 2621)	0.3262186171735005
  (0, 1773)	0.05109050060808638
  (0, 1733)	0.0674403518838804
  (0, 3088)	0.05631339134961835
  (0, 586)	0.04792028516624542
  (0, 872)	0.10044292949442259
  (0, 1793)	0.2484789762672071
  (0, 2145)	0.1091046487989682
  (0, 1388)	0.13816061799854906
  (0, 182)	0.04680861576769975
  (0, 603)	0.023805149789111916
  (0, 195)	0.2812961389127113
  (0, 2010)	0.030345321008876904
  (0, 1989)	0.05122315324431446
  (0, 2987)	0.15017530372464244
  (0, 1654)	0.058336197919806605
  (0, 1470)	0.13064807656373886
  (0, 1083)	0.057797945188267405
  (0, 1792)	0.03446327365299464
  (0, 237)	0.047645613659024685
  (0, 2669)	0.03918283276395212
  (0, 1102)	0.04679776914655783
  (0, 428)	0.06496764158160212
  (0, 858)	0.02539936067578591
  :	:
  (999, 2910)	0.07853165620880553
  (999, 1562)	0.0930128747539569
  (999, 593)	0.08987680633066734
  (999, 2447)	0.07633600153139086
  (999, 1100)	0.08067617448845865
  (999, 777)	0.14617196529581047
 

In [27]:
# Classifier: 

#classifier = Ridge(alpha=0.1)
#classifier = LinearRegression()
#classifier = DecisionTreeClassifier()
classifer = RandomForestClassifier()
# The target value (algorithm has to predict) is SalaryNormalized
y = train['SalaryNormalized']

# train model on data
classifier.fit(X_train, y)

Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [38]:
# test data set
X_test_tfidf = vectorizer.transform(test['FullDescription'])
X_test_categ = enc.transform(test[['LocationNormalized', 'ContractTime']].to_dict('records'))
X_test = hstack([X_test_tfidf, X_test_categ])


                                     FullDescription LocationNormalized  \
0  we currently have a vacancy for an hr project ...      Milton Keynes   
1  a web developer opportunity has arisen with an...         Manchester   

  ContractTime  SalaryNormalized  
0     contract               NaN  
1    permanent               NaN  


In [29]:
result = classifier.predict(X_test)

In [30]:
print result

[44724.78444727 46726.26679802]


In [31]:
# Ground truth values were obtained from an online github repo
ground_truth = [56541.76, 37190.92]

rmse = mean_squared_error(ground_truth, result)
print rmse

115281874.88608739
