# Steps
- Tokenize
- Preprocessing
- Selective stopwords removal
- Decide the vocabulary size
- TF-IDF (Bag of Words)
- Implementation
- Implementation on new data
- Incorporate on flask
- Incorporate in the webpage

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [46]:
# Dependencies
from pymongo import MongoClient
import pandas as pd
import re

import nltk
import string

from urls_list import * #where all urls and paths are saved


## Read the historic rental data

In [90]:
from urls_list import * #where all urls and paths are saved
client = MongoClient(db_connection_string)
records = list(client.ETLInsights["HistoricRental"].find({}, {'_id':0}))

In [91]:
DF = pd.DataFrame(records)

In [92]:
DF_NLP = DF[['id', 'description']]

In [93]:
DF_NLP

Unnamed: 0,id,description
0,c_7195819164,1BR / 0Ba furnished apartment/ 1br -Brand new ...
1,c_7195700072,1BR / 0Ba furnished apartment/ 1br -Spectacula...
2,c_7200564066,2BR / 2Ba 1100ft2 available nov 1 cats are OK ...
3,c_7196251096,2BR / 2Ba available nov 15 loft w/d in unit at...
4,c_7195818373,1BR / 0Ba furnished apartment/ 1br -Newly reno...
...,...,...
11208,c_7224082345,4BR / 5Ba dogs are OK - wooof furnished apartm...
11209,c_7224085040,2BR / 1.5Ba dogs are OK - wooof furnished apar...
11210,c_7224083392,1BR / 2Ba furnished apartment no smoking/ 1br ...
11211,c_7224084157,2BR / 1Ba apartment/ 2br -Unit #1 on main floo...


In [94]:
#Convert all special characters to space except '.', lower case
#DF_NLP.loc[:,'clean1'] = DF_NLP['description'].map(lambda x: re.sub(r'[^0-9a-zA-Z.]', ' ', x.lower())).copy()



In [95]:
#Tokenize the documents, strip punctuations and stemming, 
def tokenize(text):
    stem = nltk.stem.SnowballStemmer('english')
    text = re.sub(r'[^0-9a-zA-Z.]', ' ', text.lower())
    tokens = []

    for token in nltk.word_tokenize(text):
        if token not in string.punctuation:
            tokens.append(stem.stem(token))
    return tokens

In [96]:
DF_NLP.loc[:,'Tokens'] = DF_NLP['description'].map(lambda x: [item for item in tokenize(x)]).copy()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Frequency Vectors
The simplest vector encoding model is to simply fill in the vector with the frequency of each word as it appears in the document. In this encoding scheme, each document is represented as the multiset of the tokens that compose it and the value for each word position in the vector is its count. This representation can either be a straight count (integer) encoding or a normalized encoding where each word is weighted by the total number of words in the document.

In [89]:
from collections import defaultdict
#Tokenize and vectorize together
def vectorize(doc):
    features = defaultdict(int)
    for token in tokenize(doc):
        features[token] += 1
    return features

In [99]:
DF_NLP.loc[:,'Vectors'] = DF_NLP['description'].map(lambda x: vectorize(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [100]:
DF_NLP

Unnamed: 0,id,description,Tokens,Vectors
0,c_7195819164,1BR / 0Ba furnished apartment/ 1br -Brand new ...,"[1br, 0ba, furnish, apart, 1br, brand, new, on...","{'1br': 2, '0ba': 1, 'furnish': 1, 'apart': 1,..."
1,c_7195700072,1BR / 0Ba furnished apartment/ 1br -Spectacula...,"[1br, 0ba, furnish, apart, 1br, spectacular, 8...","{'1br': 2, '0ba': 1, 'furnish': 1, 'apart': 1,..."
2,c_7200564066,2BR / 2Ba 1100ft2 available nov 1 cats are OK ...,"[2br, 2ba, 1100ft2, avail, nov, 1, cat, are, o...","{'2br': 2, '2ba': 1, '1100ft2': 2, 'avail': 2,..."
3,c_7196251096,2BR / 2Ba available nov 15 loft w/d in unit at...,"[2br, 2ba, avail, nov, 15, loft, w, d, in, uni...","{'2br': 2, '2ba': 1, 'avail': 2, 'nov': 2, '15..."
4,c_7195818373,1BR / 0Ba furnished apartment/ 1br -Newly reno...,"[1br, 0ba, furnish, apart, 1br, newli, renov, ...","{'1br': 2, '0ba': 1, 'furnish': 1, 'apart': 2,..."
...,...,...,...,...
11208,c_7224082345,4BR / 5Ba dogs are OK - wooof furnished apartm...,"[4br, 5ba, dog, are, ok, wooof, furnish, apart...","{'4br': 2, '5ba': 1, 'dog': 1, 'are': 1, 'ok':..."
11209,c_7224085040,2BR / 1.5Ba dogs are OK - wooof furnished apar...,"[2br, 1.5ba, dog, are, ok, wooof, furnish, apa...","{'2br': 2, '1.5ba': 1, 'dog': 1, 'are': 1, 'ok..."
11210,c_7224083392,1BR / 2Ba furnished apartment no smoking/ 1br ...,"[1br, 2ba, furnish, apart, no, smoke, 1br, 2, ...","{'1br': 2, '2ba': 1, 'furnish': 1, 'apart': 1,..."
11211,c_7224084157,2BR / 1Ba apartment/ 2br -Unit #1 on main floo...,"[2br, 1ba, apart, 2br, unit, 1, on, main, floo...","{'2br': 2, '1ba': 1, 'apart': 1, 'unit': 3, '1..."


## Sci-kit Learn

The CountVectorizer transformer from the sklearn.feature_extraction model has its own internal tokenization and normalization methods. The fit method of the vectorizer expects an iterable or list of strings or file objects, and creates a dictionary of the vocabulary on the corpus. When transform is called, each individual document is transformed into a sparse array whose index tuple is the row (the document ID) and the token ID from the dictionary, and whose value is the count:

In [101]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(DF_NLP['description'])

CountVectorizer()

In [103]:
vectors = vectorizer.transform(DF_NLP['description'])

In [122]:
#vectors.toarray()

In [112]:
#vectorizer.get_feature_names()

## The Gensim way

Gensim’s frequency encoder is called doc2bow. 
To use doc2bow, we first create a Gensim Dictionary that maps tokens to indices based on observed order (eliminating the overhead of lexicographic sorting). The dictionary object can be loaded or saved to disk, and implements a doc2bow library that accepts a pretokenized document and returns a sparse matrix of (id, count) tuples where the id is the token’s id in the dictionary.

In [129]:
import gensim