# Picking up where we left off

I just extracted the body text and title text. Lets combine into my bag of words and calculate their importance by using the following formula;

$w_{x,y} = {tf}_{x,y} \times log(\frac{N}{df_x})$

Where;

${tf}_{x,y}$ is the frequency of x in y, <br>
$df_x$ is the number of documents containing x and,<br>
N is the total number of documents

In [17]:
import pandas as pd
import numpy as np

data_dir = "../data/2018-08-10_AV_Innoplexus/"

train_df_no_html = pd.read_csv(data_dir+'train_with_tokens_no_html.csv')

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(input='content', analyzer='word')

tfidf = vectorizer.fit_transform(train_df_no_html['title_tokens']+train_df_no_html['body_tokens'])

In [45]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=500, n_iter=5, random_state=27)

svd_array = svd.fit_transform(tfidf)

# How many components is enough?
To decide, I can look at the returns each eigen value is providing to my SVD reduction. I have to divide each eigen value by the sum of eigenvalues and see where the diminishing returns begin, visually. Where the line tapers off - that is my n.

In [46]:
print(svd_array.shape)
eigens = svd.singular_values_
eigens_sum = eigens.sum()
eigens_div = np.divide(eigens, eigens_sum)
eigens_div

(53447, 500)


array([0.01119896, 0.00688249, 0.00652275, 0.00580999, 0.00549637,
       0.00536106, 0.00520898, 0.00510423, 0.00495349, 0.0047605 ,
       0.00470081, 0.00457352, 0.00451122, 0.00445186, 0.004379  ,
       0.00427077, 0.00425519, 0.00423837, 0.00415412, 0.00412722,
       0.00408633, 0.00400814, 0.00392442, 0.00388354, 0.00382   ,
       0.0037424 , 0.00363517, 0.0036078 , 0.00356702, 0.00355378,
       0.00353542, 0.00351471, 0.00349797, 0.00349387, 0.00347682,
       0.00343113, 0.00339749, 0.00337173, 0.00333538, 0.00332126,
       0.00328911, 0.00327386, 0.00323721, 0.003231  , 0.00316856,
       0.0031474 , 0.00312513, 0.00311362, 0.00308652, 0.00305711,
       0.00304928, 0.00303995, 0.00302463, 0.00299409, 0.00298726,
       0.00298292, 0.0029544 , 0.00293562, 0.00292653, 0.00291805,
       0.00289622, 0.00288737, 0.00287636, 0.00287096, 0.00284437,
       0.00283045, 0.00280566, 0.00279687, 0.00278603, 0.00277753,
       0.00276158, 0.00275277, 0.0027382 , 0.00271963, 0.00270

In [47]:
import plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(True)

svd_data = go.Scatter(
    x = np.arange(eigens_div.shape[0]),
    y = eigens_div.tolist(),
)

layout = go.Layout(
    title="SVD Decomp Diminishing Returns"
)

fig = go.Figure(data=[svd_data], layout=layout)

iplot(fig)

I would say 500 is actually a fairly safe place to stop.

In [66]:
#features = vectorizer.get_feature_names
vector_features = vectorizer.get_feature_names()
eigen_features = [vector_features[i] for i in svd.components_[0].argsort()[::-1]][:500]

svd_df = pd.DataFrame(svd_array,columns=eigen_features)
svd_df.head()
svd_df['Tag'] = train_df_no_html['Tag']

In [67]:
svd_df.to_csv(data_dir+'train_df_tfidf.csv',index=False)