### Analyzer Script

This script is part of the broader breakdown of the [Bias](https://github.com/SarthakJShetty/Bias) project. This script cleans up the ```abstract_word_list()``` pre-processing and is a stripped down version of the [Scraper.py](https://github.com/SarthakJShetty/Bias/blob/master/Scraper.py) code.

#### 1.0 <u>Code:</u>

Here are the libraries that will be used across the code. ```numpy``` is used to generate the arrays for the ```pandas``` DataFrame. ```status_logger()``` function has been moved here as well, over from the ```common_functions``` script.

In [1]:
'''Importing OS here to split the filename at the extension'''
import os
'''Importing the collections which contains the Counter function'''
from collections import Counter
'''Importing pandas here to build the dataframe'''
import pandas as pd
'''Importing datetime library to generate the log files by the status_logger() function'''
from datetime import datetime
'''Importing numpy here to build the index of the pandas frameword'''
import numpy as np

Defining the required variables here, including the ```status_logger_name``` and the ```pwd``` of the ```abstracts_log_name```

In [2]:
abstracts_log_name = "/home/sarthak/projects/Bias/LOGS/LOG_2019-02-27_15_23_Eastern_Himalayas/Abstract_Database_2019-02-27_15_23"
log_name = abstracts_log_name.split('/')
status_logger_name = log_name[6]+"_"+'Status_Logger'

Defining the ```status_logger()``` function here which logs the status keys of the different functions for trouble-shooting in case of uneventful run of the script.

In [3]:
def status_logger(status_logger_name, status_key):
	'''Status logger to print and log details throught the running the program.
	Declaring current_hour, current_minute & current_second.'''
	current_hour = str(datetime.now().time().hour)
	current_minute = str(datetime.now().time().minute)
	current_second = str(datetime.now().time().second)

	'''Logging the complete_status_key and printing the complete_status_key'''
	complete_status_key = "[INFO]"+current_hour+":"+current_minute+":"+current_second+" "+status_key
	print(complete_status_key)
	status_log = open(status_logger_name+'.txt', 'a')
	status_log.write(complete_status_key+"\n")
	status_log.close()

Functions for the script are defined here, including ```analyzer_pre_processing()``` function which generates the ```.txt``` and the ```.csv``` files for holding the data in ```.csv``` buckets.

In [4]:
def analyzer_pre_processing(abstracts_log_name, status_logger_name):
	'''Carries out the pre-processing tasks, such as folder creation'''
	analyzer_pre_processing_status_key="Carrying out pre-processing functions for analyzer"
	status_logger(status_logger_name, analyzer_pre_processing_status_key)
    
	'''This code strips the abstracts_log_name of its extension and adds a .csv to it'''
	abstracts_csv_file_name=(log_name[6])+"_"+"FREQUENCY_CSV_DATA"+".csv"
	abstracts_txt_file_name = abstracts_log_name+"_"+"ANALYTICAL"+".txt"
	
	analyzer_pre_processing_status_key = "Carried out pre-processing functions for analyzer"
	status_logger(status_logger_name, analyzer_pre_processing_status_key)
    
	return abstracts_txt_file_name, abstracts_csv_file_name

The ```list_cleaner()``` function contains a list of words scrapped as meta-data along with the main data of interest. Improves the topic modelling results and prevents cluttering of the topical spheres generated.

In [5]:
def list_cleaner(list_to_be_cleaned, status_logger_name):
	list_cleaner_start_status_key = "Cleaning the list of words generated"
	status_logger(status_logger_name, list_cleaner_start_status_key)
    
	'''This function cleans the list containing the words found in the abstract. It eliminates words found in
	another pre-defined list of words.'''
	words_to_be_eliminated = ["the", "of", "and", "in", "to", "a", "is", "for", "from", "with", "that",	"by", "are", "on", "was", "as", "were", "url:", "abstract:",
	"abstract",  "author:", "title:", "at", "be", "an", "during", "have", "this", "which", "study", "been", "species", "not", "has", "between",
	"using", "its", "also", "these", "this", "used", "over", "can", "within", "into", "all","due", "use", "about", "a", 'it', 'their', "where", "we", "most", "may", "through",
	"though", "like", "or", "further", "e.g.", "along", "any", "those", "had", "toward", "due", "both", "some", "use", "even", "more", "but", "while", "pass", 
	"well", "will", "when", "only", "after", "author", "title", "there", "our", "did", "much", "as", "if", "become", "still", "various", "very", "out",
	"they", "via", "available", "such", "than", "different", "many", "areas", "no", "one", "two", "small", "first", "other", "such", "-", "could", "studies", "high",
	"provide", "among", "highly", "no", "case", "across", "given", "need", "would", "under", "found", "low", "values", "xe2\\x80\\x89", "xa", "xc", "xb", "\xc2\xa0C\xc2\xa0ha\xe2\x88\x921", "suggest", "up", "'The", "area"] 
	cleaned_list_of_words_in_abstract = [item for item in list_to_be_cleaned if item not in words_to_be_eliminated]

	list_cleaner_end_status_key = "Cleaned the list of words generated"
	status_logger(status_logger_name, list_cleaner_end_status_key)

	return cleaned_list_of_words_in_abstract

def transfer_function(abstracts_txt_file_name, abstracts_csv_file_name, status_logger_name):
	'''This function is involved in the actual transfer of data from the .txt file to the .csv file'''
	transfer_function_status_key = "Copying data from"+" "+str(abstracts_txt_file_name)+" "+"to"+" "+"pandas dataframe"
	status_logger(status_logger_name, transfer_function_status_key)

	'''This list will contain all the words extracted from the .txt abstract file'''
	list_of_words_in_abstract=[]

	'''Each word is appended to the list, from the .txt file'''
	with open(abstracts_txt_file_name, 'r') as abstracts_txt_data:
		for line in abstracts_txt_data:
			for word in line.split():
				list_of_words_in_abstract.append(word)

	'''This function cleans up the data of uneccessary words'''
	cleaned_list_of_words_in_abstract = list_cleaner(list_of_words_in_abstract, status_logger_name)

	'''A Counter is a dictionary, where the value is the frequency of term, which is the key'''
	dictionary_of_abstract_list = Counter(cleaned_list_of_words_in_abstract)

	length_of_abstract_list = len(dictionary_of_abstract_list)

	'''Building a dataframe to hold the data from the list, which in turn contains the data from '''
	dataframe_of_abstract_words=pd.DataFrame(index=np.arange(0, length_of_abstract_list), columns=['Words', 'Frequency'])

	'''An element to keep tab of the number of elements being added to the list'''
	dictionary_counter = 0

	'''Copying elements from the dictionary to the pandas file'''
	for dictionary_element in dictionary_of_abstract_list:
		if(dictionary_counter==length_of_abstract_list):
			pass
		else:
			dataframe_of_abstract_words.loc[dictionary_counter, 'Words'] = dictionary_element
			dataframe_of_abstract_words.loc[dictionary_counter, 'Frequency'] = dictionary_of_abstract_list[dictionary_element]
			dictionary_counter = dictionary_counter+1

	transfer_function_status_key = "Copied data from"+" "+str(abstracts_txt_file_name)+" "+"to"+" "+"pandas dataframe"
	status_logger(status_logger_name, transfer_function_status_key)

	transfer_function_status_key = "Copying data from pandas dataframe to"+" "+str(abstracts_csv_file_name)
	status_logger(status_logger_name, transfer_function_status_key)

	'''Saving dataframe to csv file, without the index column'''
	dataframe_of_abstract_words.to_csv(abstracts_csv_file_name, index=False)

	transfer_function_status_key = "Copied data from pandas dataframe to"+" "+str(abstracts_csv_file_name)
	status_logger(status_logger_name, transfer_function_status_key)

Defining the ```analyzer_main()``` function responsible for executing all functions of the Analyzer script.

In [6]:
def analyzer_main(abstracts_log_name, status_logger_name):
	'''Declaring the actual analyzer_main function is integrated to Bias.py code'''
	analyzer_main_status_key="Entered the Analyzer.py code."
	status_logger(status_logger_name, analyzer_main_status_key)

	'''Calling the pre-processing and transfer functions here'''
	abstracts_txt_file_name, abstracts_csv_file_name = analyzer_pre_processing(abstracts_log_name, status_logger_name)
	transfer_function(abstracts_txt_file_name, abstracts_csv_file_name, status_logger_name)
    
	'''Logs the end of the process Analyzer code in the status_logger'''
	analyzer_main_status_key="Exiting the Analyzer.py code."
	status_logger(status_logger_name, analyzer_main_status_key)

Calling the ```analyzer_main()``` function here to execute the script and generate the cleaned abstracts data.

In [7]:
analyzer_main(abstracts_log_name, status_logger_name)

[INFO]10:33:5 Entered the Analyzer.py code.
[INFO]10:33:5 Carrying out pre-processing functions for analyzer
[INFO]10:33:5 Carried out pre-processing functions for analyzer
[INFO]10:33:5 Copying data from /home/sarthak/projects/Bias/LOGS/LOG_2019-02-27_15_23_Eastern_Himalayas/Abstract_Database_2019-02-27_15_23_ANALYTICAL.txt to pandas dataframe
[INFO]10:33:5 Cleaning the list of words generated
[INFO]10:33:7 Cleaned the list of words generated
[INFO]10:47:22 Copied data from /home/sarthak/projects/Bias/LOGS/LOG_2019-02-27_15_23_Eastern_Himalayas/Abstract_Database_2019-02-27_15_23_ANALYTICAL.txt to pandas dataframe
[INFO]10:47:22 Copying data from pandas dataframe to LOG_2019-02-27_15_23_Eastern_Himalayas_FREQUENCY_CSV_DATA.csv
[INFO]10:47:22 Copied data from pandas dataframe to LOG_2019-02-27_15_23_Eastern_Himalayas_FREQUENCY_CSV_DATA.csv
[INFO]10:47:22 Exiting the Analyzer.py code.
