In [180]:
""" SETUP """
# NOTE: this notebook requires Python3 (3.6.1) on the target machine.
# To install Python3 and the additonal nessecary requirements (none for now),
# please view: https://realpython.com/installing-python/
# note that you need to specify the version (3.6.1) at install!

# Getting pip and jupyter 
# $ python3 -m pip install --upgrade pip
# $ python3 -m pip install jupyter

# Check correct python version (should say 3.6.1 at some point)
import sys
print(sys.executable)
print(sys.version)
print(sys.version_info)

/Library/Frameworks/Python.framework/Versions/3.6/bin/python3
3.6.1 (v3.6.1:69c0db5050, Mar 21 2017, 01:21:04) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
sys.version_info(major=3, minor=6, micro=1, releaselevel='final', serial=0)


In [181]:
""" PIP INSTALLS NECCESARY """
# Get scikit-learn and it's required packages (ML package + numerical computing)
# $ pip3 install scikit-learn 
# Get pandas for easy wrangling and data management
# $ pip3 install pandas=0.24.2
""" Package imports """
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import NMF
print("Imports completed successfully!")

Imports completed successfully!


In [182]:
# First, let's load in the data and utilise some predefined cleaning tools, removing tags etc.
raw = fetch_20newsgroups(shuffle=True, random_state=42, remove=['headers','footer','quotes'])
data = raw.data 
print("Document count: ", len(data)) # how many documents do we have?

Document count:  11314


In [183]:
# Let's try to only work with a subset (gotta go fast)

n_samples = 2500 # how many documents we will process

subsetRaw = data[:n_samples]

# let's print out the first 3 samples! 
sub_samples = data[:3]
for counter, sample in enumerate(sub_samples, 1):
    print('Printing Sample: %d' %counter + '...')
    print("")
    print(sample)
    
    

Printing Sample: 1...

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





Printing Sample: 2...

A fair number of brave souls who upgraded their SI clock oscillator have
shared their experiences for this poll. Please send a brief message detailing
your experiences with the procedure. Top speed attained, CPU rated speed,
add on cards and adapters, heat sinks, hour of usage per day, floppy disk
functionality with 800 and 1.4 m floppies are especially requested.

I will be summarizing in the next 

In [184]:
"""
Great! But you can't do math on raw words. How do we turn this stuff into math:able numbers? 
With NLP* ofc!

*natural language processing

The two most common starting points for processing text documents into numbers are:
 
        1: Term Frequency [TF] (a.k.a. Bag-of-words model)
        
        2: TF-IDF (term frequency / inverse document frequency) # See https://en.wikipedia.org/wiki/Tf%E2%80%93idf

For the first part of this tutorial, we're gonna use Term Frequency.
"""
# Load our sample of the full data set. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.


# Vectorizer + NMF params

n_features = 1000 # how many raw "features" or data points per text we want to use
n_components = 12 # how many groups (categories, in this case most ressembling "topics")
n_top_words = 10 # the n most "important" words for each topic

tf_vectorizer = CountVectorizer(
    max_df = 0.95,
    min_df = 10,
    max_features = n_features,
    stop_words = 'english'
)

tf = tf_vectorizer.fit_transform(subsetRaw)
tf.shape # Check dimensions of 2dim-array => Should output: (2500, 1000) => (n_rows, n_columns)

# Let's set up NMF!
# For a quick reminder: https://blog.acolyer.org/2019/02/18/the-why-and-how-of-nonnegative-matrix-factorization/

# Setup /w params
nmf = NMF(
    n_components=n_components,
    random_state=42,
    alpha=0.1,
    l1_ratio=0.5
)

# Apply model to our data and get 2 resulting matrices (W, H)
nmf.fit(tf)
# Get matrices
W = nmf.transform(tf)
H = nmf.components_

print("Dims of W: %s" %str(W.shape))
print("Dims of H: %s" %str(H.shape))

# Let's define a function to print the n_top_words in every category
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

# Let's print the top words behind every category/topic (i.e. the highest weighted words behind each topic)
feature_names = tf_vectorizer.get_feature_names()
print_top_words(nmf, feature_names, n_top_words)


Dims of W: (2500, 12)
Dims of H: (12, 1000)
Topic #0: max 145 04 tm 14 45 au 75 34 mr
Topic #1: available edu motif export version widget ftp based com pub
Topic #2: people said don know didn went just armenians say like
Topic #3: output file entry program stream null return line int open
Topic #4: internet anonymous privacy email information use mail address file users
Topic #5: turkish jews turkey nazis jewish war book history nazi university
Topic #6: jesus matthew people said david king lord israel course does
Topic #7: mv 17 27 sp ah 24 mw 145 mt 37
Topic #8: edu os com comp cs ca yes john org ibm
Topic #9: tape sys use windows scsi drive driver drivers disk problem
Topic #10: 00 dos 15 20 good 50 25 excellent missing 10
Topic #11: version machines type contact comments edu ftp pc anonymous available



In [185]:
""" 

Okay, a little weird but fine. See instructor for explanation.

"""

# Now we have condensed feature representations for every article we can start comparing them. For comparison,
# we're going to employ Cos.Sim. and Euclidean Distance. We will also see the effect on our feature extraction!

from sklearn.metrics.pairwise import euclidean_distances, cosine_distances

# Via Euclidean on features
euclideanDistanceMatrix = euclidean_distances(tf)
print("Check Dims of Euc Matrix: ")
print(euclideanDistanceMatrix.shape)

# Via Cosim on raw features
cosDistanceMatrix = cosine_distances(tf)
print("Check Dims of Cos Matrix: ")
print(cosDistanceMatrix.shape)

# What should the dimensions be? How many samples did we use?
# Let's check that we actually get a symmetric matrix for both:
def check_symmetric(a, rtol=1e-05, atol=1e-08):
    return np.allclose(a, a.T, rtol=rtol, atol=atol)

for matrix in [euclideanDistanceMatrix, cosDistanceMatrix]:
    print("Matrix symmetric: %s" %check_symmetric(matrix))

# Once we have this, let's query a document at random and find the 3 most similar ones.

queryIndex = 255 # just any ol' postive integer (watch out for out of index, though!)
# let's check what we got!
subsetRaw[255]

# Great! Let's pair this text up with a similar one and see if cos/euc yield different results

# Before we get to the sorting step, we're going to utilise Pandas' dataframes to better keep track of our indexes; 
# this will ensure that we do not accidently mix up elements. 
# Btw, indexing starts from 0 in Python. Don't let matlab-people ruin your day.

# Generate some more read:able names.
featureNames = []
for i in range(0,n_samples):
    featureNames.append('doc_%d' %i)
# print(featureNames)

# Create Dataframes for each matrix:
eucDf = pd.DataFrame(euclideanDistanceMatrix, columns=featureNames, index=featureNames)
cosDf = pd.DataFrame(cosDistanceMatrix, columns=featureNames, index=featureNames)

# Want to check what's cooking?
# print(eucDf.head(3))
# print(cosDf.head(3))
# Notice something along the diagonal? Why is that?

def getMostSimilarTextIndexes(index, df, n):
    """ Retrieves the n most similar items' indexes and returns a list of indexes in order of similarity """
    listOfIndexes = []
    doc = 'doc_%d' %index
    n_most_similar_docs = df.nsmallest(n, doc, keep='all').index.tolist()
    return n_most_similar_docs[1:]

# Let's Compare and contrast!
print(getMostSimilarTextIndexes(queryIndex, eucDf, 10))
print(getMostSimilarTextIndexes(queryIndex, cosDf, 10))


Check Dims of Euc Matrix: 
(2500, 2500)
Check Dims of Cos Matrix: 
(2500, 2500)
Matrix symmetric: True
Matrix symmetric: True
['doc_698', 'doc_235', 'doc_708', 'doc_58', 'doc_2454', 'doc_1065', 'doc_493', 'doc_1841', 'doc_2302']
['doc_1704', 'doc_58', 'doc_1116', 'doc_698', 'doc_708', 'doc_235', 'doc_2140', 'doc_1308', 'doc_2334']


In [186]:
# Notice how using Euclidean distance and cosine difference yielded similar but ultimately different results. y though?
# Also notice the first result is the document we queried. 
# After all, the most similar thing to a thing, is the thing itself.

"""
Now with some actual results, let's check if we agree with the computer's decision. Let's get the texts and compare.

"""

eucRes = getMostSimilarTextIndexes(queryIndex, eucDf, 3)
cosRes = getMostSimilarTextIndexes(queryIndex, cosDf, 3)

def comparePrint(res, sourceDocIndex):
    print("Printing the text we want to find friends to: ")
    print(subsetRaw[sourceDocIndex])
    print("")
    for counter, doc_id in enumerate(res, 1):
        print('-'*50)
        print("Printing number %d Most Similar" % counter)
        print(subsetRaw[int(doc_id[4:])])
    return "Done!"

print("----------------------------------- COMPARING EUC -----------------------------------")
comparePrint(eucRes, queryIndex)
print("-------------------------------------------------------------------------------------")

print("----------------------------------- COMPARING COS -----------------------------------")
comparePrint(eucRes, queryIndex)
print("-------------------------------------------------------------------------------------")


    





----------------------------------- COMPARING EUC -----------------------------------
Printing the text we want to find friends to: 

No. When the program is run, it loads 4 configuration files; autoexec.bat,
config.sys, win.ini, and system.ini. There is no Open entry on the File
menu. You can only edit these four files. If you need to edit some other
program's .ini file, use Notepad or some other ASCII editor.

I wonder whether Microsoft intended for sysedit to be used, or if it was
just a holdover from the testing period and they forgot to take it out. The
reason I think this is because there is absolutely no mention in the manuals
about this program, and there is no online help for it (just an About entry
under the File menu). The program looks like something that was intended for
internal use only. It's kind of a shame, though. It would have made a nice
multi-file replacement for Notepad.

Daniel Silevitch                           dmsilev@athena.mit.edu
Massachusetts Institute of 

This has been the tutorial!

Please take some time and go through each part to make sure you understand the gist. Please do play around
with parameters and see if you can achieve better result in the comparisons of text!

Arguably, the most important step in the whole comparsion process is the preproccessing. You can try to play around
with other vectorizers (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text).

Additionally, you could try to define your own preprocessing steps. Like we saw in our applied NMF, a lot of artifacts
of 2-character sequences get thrown in the mix. Perhaps you could try to cut away 2-character 'words' before applying
vectorization (HINT: Regular expressions!!!).

If you want something really challenging, try using the vectorized text and build a classifier to predict the group.
The dataset contains labels, but you'll need to load those in along with the rest! 

If you have any questions regarding the tutorial, please email:
erik.hakansson96@gmail.com