Version 0.1 : change tfidfkeywordsuggest.py due to change in googlesearch library . Minor bug fixed.
Anakeyn TF-IDF Keywords Suggest is a keywords suggestion tool for SEO and Web Maketing purpose. This tool searches and stores the first x pages responding to a given keyword in Google.
Next the system will get the content of the pages in order to find popular and original keywords/Expressions in the subject area. The system works with a TF-IDF algorithm.
TF-IDF means term frequency–inverse document frequency. TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
In order to calculate a "global" TF-IDF value we calculate a mean of TF-IDF for each term from all documents to find popular expressions and a non-zero mean of each term from all documents for original expressions.
The program is developed in Python in a Web format using Flask (web framework), Jinja2 (web template engine), SQLALchemy (Object-relational mapping for SQL databases),Bootstrap (front-end framework) ...
KeywordsSuggest | database.db | favicon.ico | tfidfkeywordssuggest.py | license.txt | myconfig.py | requirements.txt | __init__.py | +---configdata | tldLang.xlsx | user_agents-taglang.txt | +---static | Anakeyn_Rectangle.jpg | tfidfkeywordssuggest.css | Oeil_Anakeyn.jpg | signin.css | starter-template.css | +---templates | index.html | tfidfkeywordssuggest.html | login.html | signup.html | +---uploads
By default the system works with a SQLite database called database.db which is created the first time you use the program. The main program is "tfidfkeywordssuggest.py".
Default config variables are in the myconfig.py file including the 2 default users : admin (pwd "adminpwd") and guest (pwd "guestpwd")
Other configuration data is available in the configdata subdiretory in 2 files : tldlang.xlsx : parameters for Google Top Level domains and Search Engines Results Pages languages (358 combinations) user_agents-taglang.txt : a list of valid user agents to provide to Google randomly to avoid blocking. (4281)
Static directory contains images and .css files
Templates directory contains .html templates.
Uploads directory is dedicated to create/save all keywords files to download.
The system creates 7 "popular" keywords/expressions files : 1 file with all sizes expression in words, and one file for respectively 1, 2, 3, 4, 5 or 6 words expressions. The same for "original" keywords/expressions files. If available, the system provides a maximum of 10.000 expressions for each file. This could be enough to get ideas :-)
Download the .zip file of this application https://github.com/Anakeyn/TFIDFKeywordsSuggest/archive/master.zip and unzip it in a directory on your computer.
Download and Install Anaconda https://www.anaconda.com/distribution/#download-section
Anaconda will install tools on your computer :
Open Anaconda Prompt and go to the directory where you installed the application previously (for example for Windows : cd c:\Users\myname\document......\
Make sure you have the file "requirements.txt" in your directory : dir (Windows) or ls (Linux)
To install Library dependencies for the python code. You need to install these with the command :
For Linux : while read requirement; do conda install --yes $requirement || pip install $requirement; done < requirements.txt
For Windows : FOR /F "delims=~" %f in (requirements.txt) DO conda install --yes "%f" || pip install "%f"
Next launch Spyder and open the main Python file tfidfkeywordssuggest.py
make sure that you are in the good directory then click on the green arrow to run the Python File.
Next, open a browser an go to the address http://127.0.0.1:5000 :
Click on "Keywords Suggest" : the system is protected; Provide the defaults admin credentials : admin, adminpwd or the default guest credentials : guest, guestpwd
Next Choose an expression and a Country/Language targeted.
The system will search in Google pages responding to the Keyword, save the pages, get the content and calculate a TF-IDF for each term founded in pages. Next it will provides 14 files with up to 10.000 popular or original expressions.
As you can, see not all languages are filtered by Google (see here "lr" parameter to get the list : https://developers.google.com/custom-search/docs/xml_results_appendices#lrsp). However, with the country filter and the language specified in the user agent, the results are often exploitable.
Here you will see results of original 2 words expression for "SEO" in Swahili in Democratic Republic of Congo