# Whats Here?

A small and easy to use tutorial to familiarize with the concepts that are necessary for the BigData assignment.

# Enviroment Preparation

On your local machine it is suggested that you will use:

1. Pyenv for installing Python 3.8.
2. Virtualenv for managing a  virtual environment.
3. PyCharm for writting your solutions.

# Section 1: Create Example CSV Data

 Use 20 newsgroups data.


In [None]:
# For Tutorial Use Only!

from sklearn.datasets import fetch_20newsgroups
import pandas as pd

def twenty_newsgroup_to_csv():
    newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

    df = pd.DataFrame([newsgroups_train.data, newsgroups_train.target.tolist()]).T
    df.columns = ['text', 'target']

    targets = pd.DataFrame( newsgroups_train.target_names)
    targets.columns=['title']

    out = pd.merge(df, targets, left_on='target', right_index=True)
    out['date'] = pd.to_datetime('now')
    out.to_csv('20_newsgroup.csv')

twenty_newsgroup_to_csv()

  out['date'] = pd.to_datetime('now')


# Load the data using Pandas

We will use the pd.read_csv method.

In [None]:
import pandas as pd

news = pd.read_csv('20_newsgroup.csv', index_col=0, sep=',')

In [None]:
news

Unnamed: 0,text,target,title,date
0,I was wondering if anyone out there could enli...,7,rec.autos,2023-12-08 07:53:35.838681
17,I recently posted an article asking what kind ...,7,rec.autos,2023-12-08 07:53:35.838681
29,\nIt depends on your priorities. A lot of peo...,7,rec.autos,2023-12-08 07:53:35.838681
56,an excellent automatic can be found in the sub...,7,rec.autos,2023-12-08 07:53:35.838681
64,: Ford and his automobile. I need information...,7,rec.autos,2023-12-08 07:53:35.838681
...,...,...,...,...
11210,Secrecy in Clipper Chip\n\nThe serial number o...,11,sci.crypt,2023-12-08 07:53:35.838681
11217,Hi !\n\nI am interested in the source of FEAL ...,11,sci.crypt,2023-12-08 07:53:35.838681
11243,"The actual algorithm is classified, however, t...",11,sci.crypt,2023-12-08 07:53:35.838681
11254,\n\tThis appears to be generic calling upon th...,11,sci.crypt,2023-12-08 07:53:35.838681


# Meet the data

Each row contains a text and a title.
We need to predict the correct target.

In [None]:
# How to install a package
! pip install spacy
! pip install transformers
! pip install nltk
! pip install torch



In [None]:
# Use a python module
! python -m spacy download en_core_web_sm

2023-12-07 18:55:36.254561: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-07 18:55:36.254644: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-07 18:55:36.254691: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-07 18:55:36.284878: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading htt

# Data Frame Operations

* Select some rows based on filters
* Project some of the fields
* Use multiple conditions

In [None]:
# Print first text
print(news.iloc[0]['text'])

# Print first label
print(news.iloc[0]['target'])

# Conditional Selection
# Print all news that have as label 1
print(news[news['target'] == 1]['title'])

# Multiple Filtering
# Print all news that belong to the label 1 or label 2.
print(news[(news['target'] == 1) | (news['target'] == 2)]['title'])

# Alternative using the is in operator
print(news[news['target'].isin([1,2])]['title'])

# drop column date
news.drop(['date'], axis=1, inplace=True)

# drop all the rows where the text is exactly the same
news.drop_duplicates(subset=['text'], inplace=True)

# Drop all rows where the text is empty
news.dropna(subset=['text'], inplace=True)

# Drop all rows where the text is less than 5 characters
news = news[news['text'].str.len() > 5]

# Find documents with the term computer
pc_news = news[news['text'].str.contains('computer')]

# Split the text in tokens
news.loc[:, 'tokens'] = news['text'].str.split(r'\s+')

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
7
3        comp.graphics
16       comp.graphics
25       comp.graphics
82       comp.graphics
87       comp.graphics
             ...      
11235    comp.graphics
11240    comp.graphics
11249    comp.graphics
11289    comp.graphics
11312    comp.graphics
Name: title, Length: 584, dtype: object
3                  comp.graphics
16                 comp.graphics
25                 comp.graphics
82                 comp.graphics
87                 comp.graphics
                  ...           
11228    comp.os.ms-windows.mis

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news.loc[:, 'tokens'] = news['text'].str.split(r'\s+')


In [None]:
news

Unnamed: 0,text,target,title,tokens
0,I was wondering if anyone out there could enli...,7,rec.autos,"[I, was, wondering, if, anyone, out, there, co..."
17,I recently posted an article asking what kind ...,7,rec.autos,"[I, recently, posted, an, article, asking, wha..."
29,\nIt depends on your priorities. A lot of peo...,7,rec.autos,"[, It, depends, on, your, priorities., A, lot,..."
56,an excellent automatic can be found in the sub...,7,rec.autos,"[an, excellent, automatic, can, be, found, in,..."
64,: Ford and his automobile. I need information...,7,rec.autos,"[:, Ford, and, his, automobile., I, need, info..."
...,...,...,...,...
11210,Secrecy in Clipper Chip\n\nThe serial number o...,11,sci.crypt,"[Secrecy, in, Clipper, Chip, The, serial, numb..."
11217,Hi !\n\nI am interested in the source of FEAL ...,11,sci.crypt,"[Hi, !, I, am, interested, in, the, source, of..."
11243,"The actual algorithm is classified, however, t...",11,sci.crypt,"[The, actual, algorithm, is, classified,, howe..."
11254,\n\tThis appears to be generic calling upon th...,11,sci.crypt,"[, This, appears, to, be, generic, calling, up..."


In [None]:
# Find the number of tokens
news['tokens_number'] = news['tokens'].str.len()

# Create a Trainset and a TestSet

In [None]:
from sklearn.model_selection import train_test_split

# X is the feature matrix
# y is the labels

X = news.drop(['target','title'], axis=1)
y = news['title']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Print the shape of the train and the test set
print(X_train.shape)

# Print the shape of the train and the test set
print(X_test.shape)

(8781, 2)
(2196, 2)


In [None]:
news

Unnamed: 0,text,target,title,tokens
17,I recently posted an article asking what kind ...,7,rec.autos,"[I, recently, posted, an, article, asking, wha..."
77,"I have a 1986 Acura Integra 5 speed with 95,00...",7,rec.autos,"[I, have, a, 1986, Acura, Integra, 5, speed, w..."
2554,"\n\n*nnnnnnnng* Thank you for playing, I canno...",7,rec.autos,"[, *nnnnnnnng*, Thank, you, for, playing,, I, ..."
4335,Archive-name: rec-autos/part4\n\n[this article...,7,rec.autos,"[Archive-name:, rec-autos/part4, [this, articl..."
6176,W >>will NOT do work on internal engine compon...,7,rec.autos,"[W, >>will, NOT, do, work, on, internal, engin..."
...,...,...,...,...
10861,"Archive-name: ripem/faq\nLast-update: Sun, 7 M...",11,sci.crypt,"[Archive-name:, ripem/faq, Last-update:, Sun,,..."
10897,\nTry reading between the lines David - there ...,11,sci.crypt,"[, Try, reading, between, the, lines, David, -..."
10913,\n So in a few years there could be millions o...,11,sci.crypt,"[, So, in, a, few, years, there, could, be, mi..."
11074,\nI think your experiences under the Bulgarian...,11,sci.crypt,"[, I, think, your, experiences, under, the, Bu..."


# Text Features

Text is slightly different from the usual categorical features.
Lets assume that we have the text "I  like to code.". This is a string representation. However all machine learning  models are capable of understanding input vectors. There are different of ways to transform a string to a vector:

1. Bag of Words Baseline but very effective. Various alternatives, counts based, binary based, TF-IDF.

2. Word Embeddings: More advanced (e.g., Word2Vec)

3. Contextual Embeddings: Highly advanced. Useful for natural language understanding. (e.g., GPT-2, BART, BERT)

We will use in this tutorial the BagOfWords approach.
This approach needs a Vectorizer. This Vectorizer is a special function that maps a string to a vector. This vector has multiple columns where each column correspond to a specific token.

Lets assume that we have a simple vectorizer with 10 words.

* Index 0: I
* Index 1: a
* Index 2: code
* Index 3: dit
* Index 4: like
* ...
* Index 9: to

The vector representation of the example  "I like to code" is:

* [1,0,1,0,1,0,0,0,0,0,1]

## Question to the audience: Why?

Lets do this transformation with Python.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

# learn the words
vectorizer.fit(X_train['text'])

In [None]:
X_train['text']

1605    Today I recieved a in-warranty replacement for...
3971    To all hardware and firmware gurus:\n\nMy curr...
8357    The  N A T I O N A L  D A Y\n                 ...
5121    Hi,\n\nI just compiled the X11R5 distribution ...
7803    \nConcurrent has a product called RealTimeX (t...
                              ...                        
5430                                             ^^^^^...
3289     \n(Deletion)\n \nSince this drivel is also cr...
9449    /(Frank DeCenso)\n/>\n/>I need to prioritize t...
6074    this must be a FAQ from the very first days of...
3929    i am in need of a motif-based graphing package...
Name: text, Length: 8781, dtype: object

In [None]:
# lets tranform the first document

print(X['text'].iloc[0])

# All the non empty entries displayed
print(vectorizer.transform([X['text'].iloc[0]]))

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
  (0, 6608)	1
  (0, 7487)	1
  (0, 11373)	1
  (0, 12312)	1
  (0, 13162)	2
  (0, 15682)	1
  (0, 16932)	1
  (0, 17537)	1
  (0, 17943)	1
  (0, 18596)	1
  (0, 18685)	1
  (0, 18844)	4
  (0, 22600)	1
  (0, 24144)	1
  (0, 26644)	1
  (0, 26645)	1
  (0, 27539)	1
  (0, 28610)	1
  (0, 28654)	1
  (0, 32316)	2
  (0, 32325)	1
  (0, 32536)	1
  (0, 35639)	1
  (0, 36387)	1
  (0, 37850)	2
  :	:
  (0, 52547)	2
  (0, 52876)	2
  (0, 53146)	1
  (0, 53448)	1
  (0, 53501)	1
  (0, 56034)	1
  (0, 57377)	1
  (0, 59750)	1
  (0, 61020)	1
  (0, 6320

In [None]:
# Lets use a more powerfull vectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

X.fillna('', inplace=True)

tfidf_vectorizer.fit(X['text'])

print(X['text'].iloc[0])

# All the non empty entries displayed
print(tfidf_vectorizer.transform([X['text'].iloc[0]]))

# The common words over documents receive penalty!

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
  (0, 100208)	0.04127283119234359
  (0, 99911)	0.08182426960694761
  (0, 97181)	0.11931455263632948
  (0, 96433)	0.07419941576116752
  (0, 96395)	0.10742700604854923
  (0, 96247)	0.06713729167374488
  (0, 95844)	0.20741045940818417
  (0, 89360)	0.029568574947242098
  (0, 88767)	0.1626188405013501
  (0, 88638)	0.05160967738843712
  (0, 88532)	0.16159348472576401
  (0, 88143)	0.2250954292778976
  (0, 84538)	0.1425016492411255
  (0, 84276)	0.1421597187947702
  (0, 83426)	0.0986077056467167
  (0, 81658)	0.12714420529700396

# Text preprocessing is very important!

Have a look: https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing

In [None]:
# New 2023-2024

# Lets use embeddings from a powerfull model such as BERT.
from transformers import BertModel, BertTokenizer

model_name = 'bert-base-uncased'

tokenizer = BertTokenizer.from_pretrained(model_name)
# load
model = BertModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
import torch

# a random document with up to 256 characters
document = news['text'].values.tolist()[0][:256]
input_ids = tokenizer.encode(document, add_special_tokens=True, padding=True, return_tensors='pt')

In [None]:
document

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate'

In [None]:
# What is this 101 [CLS]? 102[SEP]?
input_ids

tensor([[  101,  1045,  2001,  6603,  2065,  3087,  2041,  2045,  2071,  4372,
          7138,  2368,  2033,  2006,  2023,  2482,  1045,  2387,  1996,  2060,
          2154,  1012,  2009,  2001,  1037,  1016,  1011,  2341,  2998,  2482,
          1010,  2246,  2000,  2022,  2013,  1996,  2397, 20341,  1013,  2220,
         17549,  1012,  2009,  2001,  2170,  1037,  5318,  4115,  1012,  1996,
          4303,  2020,  2428,  2235,  1012,  1999,  2804,  1010,  1996,  2392,
         21519,  2001,  3584,   102]])

In [None]:
predictions = model(input_ids)

In [None]:
predictions[0].shape

torch.Size([1, 64, 768])

In [None]:
# print the cls embedding
phrase_embedding = predictions[0][0][1]
phrase_embedding

tensor([ 3.2221e-01, -2.1956e-01,  1.6675e-01,  8.0924e-02,  4.4274e-01,
         8.5647e-01,  1.2253e-02,  9.7457e-01,  5.3790e-01, -9.5959e-01,
        -3.6860e-01, -3.4091e-02, -1.7904e-01,  4.9649e-01, -1.0299e-01,
        -1.2634e-01,  8.0184e-01, -1.7032e-01, -2.6140e-01,  7.2003e-01,
         2.9949e-01,  1.6195e-01, -1.0233e+00,  1.4288e+00,  2.4439e-01,
        -2.1980e-01,  1.0202e-01, -2.5375e-01, -1.2900e-02, -5.0867e-01,
        -5.6432e-03, -4.2304e-01, -6.4476e-01,  2.2635e-01, -6.0859e-01,
        -8.2648e-01, -5.9160e-01, -3.0583e-01, -3.0215e-01,  5.0255e-01,
        -3.8436e-01, -3.6690e-01,  3.4409e-01, -4.3831e-01,  3.6966e-01,
         1.0151e-01,  3.3124e-02, -7.2291e-01,  3.7771e-01, -7.2943e-01,
        -1.3372e-02, -8.9029e-02,  5.9237e-01,  4.7091e-01,  3.6514e-01,
         6.0188e-01,  9.3226e-02,  9.9164e-02,  5.9567e-01,  1.7458e-01,
         2.3297e-01, -1.1302e-01,  1.1998e+00, -7.4951e-01,  1.7279e-01,
         4.7927e-01,  1.6640e-02, -6.8788e-01,  6.3

In [None]:
phrase_embedding

tensor([-2.1747e-01, -3.3792e-01, -1.8371e-01,  1.6238e-01, -1.3658e-01,
        -1.1490e-01, -1.3257e-01,  6.2647e-01,  8.0208e-02, -5.5115e-01,
         4.5636e-01, -1.3574e-01,  1.1719e-01,  2.1668e-01, -6.5669e-03,
         1.7977e-01, -5.2017e-01,  3.7854e-01,  2.5887e-01,  1.0308e-01,
        -1.4298e-01, -2.2667e-01,  4.2026e-01,  2.7508e-01, -9.3527e-03,
         3.4152e-01, -7.0173e-02, -6.2335e-02,  6.5632e-02,  1.8052e-01,
        -2.2264e-01,  4.1759e-01, -6.5978e-01, -2.9784e-01,  1.6194e-01,
        -3.0949e-01, -1.4435e-01, -2.7707e-01, -4.3086e-01,  9.5564e-02,
        -6.3353e-01,  2.2282e-01, -2.6085e-01, -8.6850e-03, -4.0609e-01,
        -4.3817e-01, -3.7310e+00,  2.4011e-01, -1.1686e-01, -1.5812e-01,
        -5.7976e-02, -5.6637e-01, -3.6481e-01,  4.1634e-01,  4.7982e-01,
         3.2633e-01, -6.7033e-01, -2.5609e-01,  3.2261e-01, -5.0866e-01,
         8.5982e-01, -3.5230e-01,  1.7569e-01,  1.9935e-01,  2.8095e-01,
        -4.1526e-02,  1.2276e-01, -5.5065e-02, -2.5


![Click Here](https://miro.medium.com/v2/resize:fit:1166/0*qULsuUPV-nMIdwqs.jpg)

If you would like to use embeddings experiment using:

* Word2Vec
* Larger BERT Models (Encoder)
* DistiBERT (Encoder)
* GPT-2 (Decoder)
* T-5 (Encoder-Decoder)

Remember to use GPU Acceleration.

# Classifying Documents

Now the documenrs are actually N dimensional vectors. A simple machine learning model such as logistic regression should be able to learn how to seperate the categories.

In [None]:
from sklearn.linear_model import LogisticRegression

clf  = LogisticRegression()
X_train.fillna('', inplace=True)

X_train_features = tfidf_vectorizer.transform(X_train['text'])
clf.fit(X_train_features, y_train)

# Why penalty is None? What about an SVM with way more complex parameters?
# See https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
# Note:
# 1. Check GridSearchCV
# 2. Bayessian Optimization or HyperOpt

# Experiment with

1. Random Forest
2. SVM
3. Naive Bayes

Hint

* https://www.youtube.com/watch?v=3liCbRZPrZA

In [None]:
# Predict the test set
X_test.fillna('', inplace=True)
X_test_features = tfidf_vectorizer.transform(X_test['text'])
y_pred = clf.predict(X_test_features)

In [None]:
# Is everything OK?
print(list(y_test[0:10]))
print(list(y_pred[0:10]))
# Some mistakes exist

['sci.electronics', 'rec.sport.hockey', 'comp.sys.ibm.pc.hardware', 'sci.electronics', 'misc.forsale', 'rec.autos', 'talk.politics.mideast', 'rec.autos', 'rec.sport.baseball', 'comp.sys.mac.hardware']
['sci.electronics', 'sci.med', 'comp.sys.ibm.pc.hardware', 'sci.electronics', 'misc.forsale', 'sci.electronics', 'talk.politics.mideast', 'rec.autos', 'rec.sport.baseball', 'comp.sys.mac.hardware']


In [None]:
# Lets check our performance:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
# For a  very simple model quite good

                          precision    recall  f1-score   support

             alt.atheism       0.57      0.64      0.60        77
           comp.graphics       0.59      0.70      0.64       115
 comp.os.ms-windows.misc       0.73      0.65      0.69       123
comp.sys.ibm.pc.hardware       0.68      0.67      0.67       125
   comp.sys.mac.hardware       0.72      0.60      0.66       106
          comp.windows.x       0.76      0.79      0.78       111
            misc.forsale       0.58      0.78      0.66       116
               rec.autos       0.80      0.67      0.73       122
         rec.motorcycles       0.65      0.81      0.72       112
      rec.sport.baseball       0.76      0.90      0.82       122
        rec.sport.hockey       0.90      0.87      0.89       107
               sci.crypt       0.90      0.69      0.78       118
         sci.electronics       0.63      0.68      0.65       121
                 sci.med       0.83      0.86      0.84       133
         

# Part 2: Find the Nearest Neighbors

Identifying the nearest neighbors using brute force method

**Not a very good idea for large collections of documents.**

In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:
# Lets find the neighbor of the first item in the test set


array([[1446, 2980]])

In [None]:
nbrs = NearestNeighbors(n_neighbors=2, algorithm='brute', metric='cosine').fit(X_train_features)

distances, indices = nbrs.kneighbors(X_test_features)

print("Source Example:")
print(X_test.iloc[5]['text'])

print("Neighbor:")
print(X_train.iloc[indices[5][0]]['text'])

Source Example:


What do photo radar units look like?  Also, what major U.S. cities use it?
Neighbor:


We had those f*****g photo-radar things here in Sweden a while ago.
There was a lot of fuzz about them, and a lot of sabotage too (a spray-can
with touch-up paint can do a lot of good...).

Eventually they had to drop the idea as there were a lot of court-cases
where the owner of the car could prove he didn't drive it at the time
of speeding.

I especially recall a case where it eventually proved to be a car-thief that
had stolen a car and made false plates. He, ofcourse, chose a license number
of a identical car, so the photo seemed correct...

In conclosion: Photo-radar sucks, every way you look at it!


# Lets make the computation Faster!

## LSH For Estimating the nearest neighbors faster

1. Cosine = Random Projection
2. Jaccard = Min-Hash

LSH hashes an input document to a bucket. Classical hashing tries to have uniform collisions. LSH is exactly the oposite, If the objects are similar result to the same buckets! Then we brute force in the common bucket.

*Warning:* LSH does not perform well in cases where the real nearest neighbor is very far away.

In [None]:
# A naive approach to perform random projection

from sklearn import random_projection

# Hint: https://www.pinecone.io/learn/locality-sensitive-hashing-random-projection/

# n_components is the number of random hyperplanes
transformer = random_projection.GaussianRandomProjection(n_components=10)

X_new = transformer.fit_transform(X_train_features)

In [None]:
X_new.shape

(8781, 10)

In [None]:
X_new[0]

array([-0.13976648,  0.36125476, -0.21785384,  0.56343834, -0.24569417,
       -0.23880625,  0.05292513,  0.04899369, -0.10283475, -0.51788516])

In [None]:
# Create the fingerprints
X_new = X_new > 0

In [None]:
X_new

array([[False,  True, False, ...,  True, False, False],
       [ True,  True, False, ..., False, False, False],
       [False,  True,  True, ..., False, False,  True],
       ...,
       [ True,  True,  True, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [ True,  True, False, ...,  True, False,  True]])

In [None]:
# Naive hashing
hashes = X_new.sum(axis=1)

# Proper hash values for every row
better_hash = []
for c in X_new:
  better_hash.append(sum([j*(2**i) for i,j in list(enumerate(reversed(c)))]))

In [None]:
# For min hash you should  use the Data Sketch library

import numpy as np

better_hash = np.array(better_hash)

X_train['hash'] = better_hash

In [None]:
X_train['hash']

1605    332
3971    888
8357    505
5121    336
7803    138
       ... 
5430    127
3289    504
9449    960
6074    120
3929    861
Name: hash, Length: 8781, dtype: int64

In [None]:
# For every unique hash you should create a specific instance of NN
hash = X_train['hash'].unique()[0]
bucket = X_train_features[X_train['hash']==hash]
nn =  NearestNeighbors(n_neighbors=2, algorithm='brute', metric='cosine').fit(bucket)
# Then just look into the correct bucket

Quick hint:

* https://www.youtube.com/watch?v=dgH0NP8Qxa8
* https://www.youtube.com/watch?v=Arni-zkqMBA

# Part 3

When to documents are similar?:

1. When they share the same words!
2. When the share the same n-grams!

What happens when paraphrases are used?
1. TFIDF Vectorizer will fail.
2. Embeddings is an approach to solve the problem.
3. Latent Semantic Analysis will do the job.

What about the semantics that are provided in the document?

1. I will go to a play about football tomorrow.
2. I will go to play football tomorrow.

Play has different meaning in those two sentences.

Solution: Use contextual embeddings (BERT, ALBERT, DISTIBERT)

A different approach:

Linquistic features of a document.

# Lets meet spacy!

1. Provides the entities of a document.
2. Syntax that can provide interactions between entities.
3. Provides valuable preprocessing steps.  

In [None]:
import spacy

In [None]:
article = '\n'.join(news['text'].values[0:100].tolist())

In [None]:
print(article)

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.
I recently posted an article asking what kind of rates single, male
drivers under 25 yrs old were paying on performance cars. Here's a summary of
the replies I received.
 
 
 
 
-------------------------------------------------------------------------------
 
I'm not under 25 anymore (but is 27 close enough).
 
1992 Dodge Stealth RT/Twin Turbo (300hp model).
No tickets, no accidents, own a house, have taken defensive driving 1,
airbag, abs, security alarm, single.
 
$1500/year  $500 decut. State Farm Insurance (this in

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
# analyze the document
doc = nlp(article)

In [None]:
# check the sentences
sents = doc.sents

for i, sent in enumerate(sents):
  print(sent)
  if i>10:
    break

I was wondering if anyone out there could enlighten me on this car I saw
the other day.
It was a 2-door sports car, looked to be from the late 60s/
early 70s.
It was called a Bricklin.
The doors were really small.
In addition,
the front bumper was separate from the rest of the body.
This is 
all I know.
If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

I recently posted an article asking what kind of rates single, male
drivers under 25 yrs old were paying on performance cars.
Here's a summary of
the replies I received.
 
 
 
 

-------------------------------------------------------------------------------
 
I'm not under 25 anymore (but is 27 close enough).
 

1992 Dodge Stealth RT/Twin Turbo (300hp model).

No tickets, no accidents, own a house, have taken defensive driving 1,
airbag, abs, security alarm, single.
 
$1500/year  $500 decut.


In [None]:
# check the entities
ents = doc.ents

for i, ent in enumerate(ents):
  print(f'Entity: {ent.text}, Type: {ent.label_}')
  if i>100:
    break

Entity: the other day, Type: DATE
Entity: 2, Type: CARDINAL
Entity: the late 60s/
early 70s, Type: DATE
Entity: Bricklin, Type: GPE
Entity: years, Type: DATE
Entity: 25, Type: CARDINAL
Entity: 25, Type: CARDINAL
Entity: 27, Type: CARDINAL
Entity: 1992, Type: DATE
Entity: Dodge Stealth RT/Twin Turbo, Type: ORG
Entity: 1, Type: CARDINAL
Entity: 1500, Type: MONEY
Entity: 500, Type: MONEY
Entity: State Farm Insurance, Type: ORG
Entity: the additional $100, Type: MONEY
Entity: 1,000,000, Type: MONEY
Entity: 300,000, Type: MONEY
Entity: DE, Type: GPE
Entity: 2nd, Type: CARDINAL
Entity: 5%, Type: PERCENT
Entity: September 1992, Type: DATE
Entity: 11 years, Type: DATE
Entity: 2,500, Type: MONEY
Entity: Steve Flynn, Type: PERSON
Entity: University of Delaware, Type: ORG
Entity: 45, Type: CARDINAL
Entity: last year, Type: DATE
Entity: 24, Type: CARDINAL
Entity: 1992, Type: DATE
Entity: Eagle Talon, Type: ORG
Entity: AWD
    , Type: ORG
Entity: Illinois
    Cost:, Type: ORG
Entity: 820/6, Type: M

In [None]:
# Check: https://spacy.io/usage/linguistic-features#pos-tagging