# Techniques for keyword extraction

### TFIDF ### 
Uses TFIDF to find the most relavant words and use them as keywords. Good for keywords themselves.


### RAKE (From NTLK) ###
Rapid Automatic Keyword Extraction. Good for multi-word phrases. 
- Calculates frequency of content words. 
- Creates list of phrases from content words.
- Creates co-occurence matrix which counts frequencies that words are associated with each other in the list of phrases. 
- Scores each phrase based its words and their scores in the co-occurence matrix, divided by their freqencies

### Machine/Deep Learning Approach ###
- Extracting skills from CVs with Keras neural network: https://towardsdatascience.com/deep-learning-for-specific-information-extraction-from-unstructured-texts-12c5b9dceada
- Keyword Suggestion Tool with Tensorflow: https://wordlift.io/blog/en/keyword-suggestion-tool-tensorflow/


## Conclusion

Because we want to use keywords for tags, which are usually individual words or at most 2-word phrases, it may be best to use TFIDF. Most of it is likely to be NLP / text pre-processing work. A deep learning approach is more advanced and based on my research has limited guides/tutorials (mostly research/graduate-level papers). 




Sources:

1) https://nzmattgrant.wordpress.com/2018/01/31/a-comparison-of-rake-and-tf-idf-algorithms-for-finding-keywords-in-text/
2) https://monkeylearn.com/keyword-extraction/
3) https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
4) https://github.com/kavgan/nlp-in-practice/blob/master/tf-idf/Keyword%20Extraction%20with%20TF-IDF%20and%20SKlearn.ipynb


# Demo of TFIDF keyword extraction
limited preprocessing and nlp

In [39]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer as tfidf
from sklearn.feature_extraction.text import TfidfTransformer


In [40]:
tech = pd.read_csv("C:/Users/Jenna/Documents/medium-tech-data.csv")

In [41]:
#remove nulls/empty vals
tech = tech[pd.notnull(tech['text'])]
tech.reset_index(inplace = True)

In [42]:
#Create TFIDF matrix from all article text
vector = tfidf(max_df = 0.85, stop_words = "english", strip_accents = 'ascii', max_features = 10000)
vector.fit_transform(tech.text.tolist())

<4002x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 182523 stored elements in Compressed Sparse Row format>

In [59]:
feature_names = vector.get_feature_names() #?
test_article = []
test_article.append(tech.text[0])

In [60]:
#Get TFIDF for individual article
article_vector = vector.transform(test_article) 
article_coord = article_vector.tocoo()

#sort vector values descending order
tuples = zip(article_coord.col, article_coord.data)
sorted_values = sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

#get the top 5 values
sorted_values = sorted_values[:5]

score_vals = []
feature_vals = []

for idx, score in sorted_values:
    fname = feature_names[idx]

    #keep track of feature name and its corresponding score
    score_vals.append(round(score, 3))
    feature_vals.append(feature_names[idx])

#create a tuples of feature,score
#results = zip(feature_vals,score_vals)
results= {}
for idx in range(len(feature_vals)):
    results[feature_vals[idx]]=score_vals[idx]
    
results

{'infarm': 0.341,
 'farmers': 0.274,
 'data': 0.244,
 'crops': 0.208,
 'farming': 0.204}