In [6]:
#from task2a_preprocessing import Preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
import pandas as pd

In [7]:
# read in dataframe from csv
df_raw = pd.read_csv('results_scrapping.csv')

## TF-IDF Vectorization
Goal is to create a document-term matrix that contains the tf-idf values for words within each document. A high tf-idf score represents a word that appears often in a document but not very often in the corpus. This means that this word is likely usefully for dokument classification. Words that appear often in a document but also often in the corpus will get a low tf-idf score.

In [9]:
# generate tf-idf matrix vectorizer
vectorizer = TfidfVectorizer(stop_words='english', use_idf=True, smooth_idf=True)

## Singular Value Decomposition (SVD) for dimensionality reduction
the resulting document-term matrix is a huge matrix with a lot of noisy and redundant information. Therefore, we want to reduce the dimensions to only a few latent topics that capture the relationships among the words and documents.

In [11]:
# generate svd model
# n_components represents the number of topics
svd_model = TruncatedSVD(n_components=10, algorithm='randomized', n_iter=10)

## Build Pipeline with tf-idf vectorization and Singular Value Decomposition

In [12]:
# build pipeline with tf-idf vectorizer and svd model
svd_transformer = Pipeline([('tfidf', vectorizer), ('svd', svd_model)])

In [13]:
svd_matrix = svd_transformer.fit_transform(df_raw['Content'])
svd_matrix

array([[ 1.42548101e-01, -2.60729332e-03, -1.52210595e-02,
         3.03336260e-02, -1.27723027e-01,  1.89021619e-01,
         7.57410543e-02, -1.42134710e-01,  5.78917237e-02,
         1.80433048e-01],
       [ 1.58755866e-01,  4.02091261e-03, -2.17624589e-02,
         2.85979852e-02, -5.40824958e-02,  8.38432350e-02,
         1.30499727e-01, -3.35990182e-02,  7.33855172e-02,
         9.95617531e-04],
       [ 1.12444230e-01,  1.67944657e-01, -1.18248756e-02,
         7.90589143e-03, -4.97136671e-02,  1.08539963e-01,
        -5.61801699e-02,  7.74257901e-02,  2.96462359e-01,
        -6.90240057e-02],
       [ 1.74237367e-01,  1.42115128e-01,  1.22214876e-02,
         6.25551567e-02, -1.56612879e-01,  2.78450766e-01,
         8.19978335e-02,  3.07760587e-02,  1.96382623e-01,
         1.79546237e-01],
       [ 1.36579119e-01, -2.82160602e-03, -2.34583281e-02,
         2.26345298e-02, -4.00517570e-02,  1.60006540e-02,
         2.08856487e-02,  1.03386716e-01, -1.59172338e-02,
        -1.

## Topic extraction
The matrix plots a score for each document for each topic.
Todo
- Find corresponding topics for each number
  - might be difficult since we don't even know if there is a word for each topic
  - maybe find words that define each topic from tf-idf matrix
- figure out how many topics we want