## HashTag Creator
By Ivelize

#### Problem definition: given a set of txt files, find "most relevant" words, and the sentences where they are used.

The algorithm's criteria to sort words is according to TF-IDF, which was implemented with scikit-learn framework.

#### Assumption: 
   - the files should be in the same notebook's directory and encoded in 'ascii';

In [113]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import collections
import pandas as pd
import numpy as np
import glob
import re

In [114]:
# Reading the txt files
files_content = list()
dict_setences_per_file = {}
for filename in glob.glob('*.txt'):
    with open(filename) as f:
        doc_text = f.readlines()
        f.seek(0)
        sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', f.read())
    f.close()
    # remove whitespace characters like `\n` at the end of each line
    doc_text = [x.strip() for x in doc_text]
    sentences = [x.strip() for x in sentences]
    files_content.extend(doc_text)
    dict_setences_per_file[filename.split('.')[0]] = sentences 

In [115]:
# Tokenize and counting the word occurrences with tf-idf normalization
tf = TfidfVectorizer(analyzer='word', stop_words='english', min_df=0.1, lowercase=True)
words_tf = tf.fit_transform(files_content)
features_per_doc = tf.get_feature_names()

In [116]:
# Selecting the words that are most important in the rows of tfidf matrix
tfidf_means_docs = []
for w_tf in words_tf:
    tfidf_means = np.mean(w_tf.toarray(), axis=0)
    tfidf_means_docs.append(tfidf_means)

In [117]:
# Sorting 100 most representative words of each document
topn_ids = []
feature_qty = 100
for tfidf in tfidf_means_docs:
    topn_ids.append(np.argsort(tfidf)[::-1][:feature_qty])

In [118]:
# Selecting the top features
list_top_feats = []
for t_id in topn_ids:
    features = features_per_doc
    top_feats = [(features[i]) for i in t_id]
    list_top_feats.extend(top_feats)

In [119]:
# Counting the occurrency of the most relevant features of the files
top_feats_counted = collections.Counter(list_top_feats)
threshold_occurrence_qty = 1
dict_feats = dict((k, v) for k, v in top_feats_counted.items() if v > threshold_occurrence_qty)
print(dict_feats)

{u've': 310, u'just': 310, u'people': 310, u'country': 310, u'work': 310, u'know': 310, u'time': 310, u'america': 310}


In [120]:
dict_final = {}
for doc, list_sentences in dict_setences_per_file.items():
    for sentence in list_sentences:
        for word, freq in dict_feats.items():
            if word in sentence:
                info={}
                if word in dict_final.keys():
                    value = dict_final.pop(word)
                    info['files'] = value['files']
                    if doc not in value['files']:
                        info['files'].append(doc)
                    info['sentence'] = value['sentence']
                    if doc not in value['sentence']:
                        info['sentence'].append(sentence)
                    dict_final[word] = info
                else:
                    info['files'] = [doc]
                    info['sentence'] = [sentence]
                    dict_final[word]= info

In [121]:
df = pd.DataFrame(dict_final)

In [122]:
df

Unnamed: 0,country,just,know,people,time,ve,work
files,"[doc2, doc3, doc1, doc4]","[doc2, doc3, doc1, doc6, doc4]","[doc2, doc3, doc1, doc6, doc4]","[doc2, doc3, doc1, doc6, doc4]","[doc2, doc3, doc1, doc6, doc4]","[doc2, doc3, doc1, doc6, doc4]","[doc2, doc3, doc1, doc6, doc4]"
sentence,[It is that promise that has always set this c...,[And when one of his chief advisors - the man ...,"[Four years ago, I stood before you and told y...","[Tonight, I say to the American people, to Dem...",[I am grateful to finish this journey with one...,[Let me express my thanks to the historic slat...,[Let me express my thanks to the historic slat...
