# Labour Market Intelligence Tool

## 4.1.1.2 Data Pre-processor

### Module Objective:

This module will handle pre-processing of data in order to get data into the right form for analysis. The following operations will be performed:

1. Load data downloaded by the Scraping Module.
2. If more data exist, then load as well and merge.
3. Clean the data.
4. Integrate missing data
5. De-duplicate data

### Pre-requisite: 

1. A thorough understanding of the data.
2. Data has been scraped from job portals.

### Input:
Scraped Job Ads in either JSON data format or in a MongoDB Collection.

### Output:
Cleaned data in JSON format

### Import Libraries:

In [1]:
import tkinter as tk # Tkinter is used so that file import/export is simplified and chances of errors minimized. 
from tkinter.filedialog import askopenfilename

import pandas as pd
from pandas import DataFrame

import time

In [38]:
def data_loader():
    
    def greet():
        welcome_msg = ('Welcome to the Data Pre-processor of the LMI System. '+
                       'We will begin by loading the data sets into memory.')        
        print(welcome_msg)
    
    
    greet() 
    time.sleep(0.20) # Jupyter Notebook Print-vs-Input Workaround. 
                 # This will prevent the input statement below from executing before the print. 
                # https://stackoverflow.com/questions/50439035/jupyter-notebook-input-line-executed-before-print-statement
    
    
    main_menu_msg = ('\nPlease select an option [1, 2 or 0 to exit]: '+
                     '\n [1] To Import JSON data format.'+                     
                     '\n [2] To Import from MongoDB.'+
                     '\n [0] To Exit.'+
                     '\n ')        
    try:
        main_menu_option = int(input(main_menu_msg)) 
        if (main_menu_option==1):
            
            # Define a function-L1
            def json_file_handler():
                msg = '\nLoad JSON Documents.'
                print(msg)
                time.sleep(0.20) # Jupyter Notebook Print-vs-Input Workaround. 
                    # This will prevent the input statement below from executing before the print. 
                # https://stackoverflow.com/questions/50439035/jupyter-notebook-input-line-executed-before-print-statement

                # A sub function-L2
                def json_file_path():
                    input('Press Enter to browse to the file location.') # Any key works.Pause to ensure user is in control.                                                                 
                    root = tk.Tk()                             
                    root.withdraw() # Hides the root frame. Else, a frame will be hanging around & closing it will cause a crash
                    root.update()  # may be necessary to force the hiding of the main frame.
                    file_name = askopenfilename()             
                    root.destroy()  
                    return file_name
                
                json_file_name = json_file_path()
                print('The selected file is: ' + json_file_name)     


                json_structure_msg =('\nPlease select an option [1 or 2]'+
                    ' to indicate how JSON Objects were organized in the file:\n'+
                    '  [1] An array of JSON objects seperated by a comma \',\' as shown below:\n'+
                    '    [{"Job_Id":"15","Location":"Aberdeen"}, {"Job_Id":"67","Location":"Dundee"}] \n\n'+
                    '  [2] One JSON object per line usually when JSON was exported from MongoDB as shown below:\n'+
                    '    NB- Anything within {} is treated as a single line even if the text spans multiple lines.\n'+
                    '    {"Job_Id":"11","Location":"Cork"}\n'+
                    '    {"Job_Id":"30","Location":"England"}\n')

                json_structure_option = int(input(json_structure_msg))
                if (json_structure_option==1):
                    df = pd.read_json(path_or_buf=json_file_name, lines=False) 
                    print (json_file_name)
                    print(df.info()) 

                elif(json_structure_option==2):
                    df = pd.read_json(path_or_buf=json_file_name, lines=True)    
                    print (json_file_name)
                    print(df.info()) 

                else:
                    print('Invalid Input')              
                                          
                    
            def merge_data():                    
                    merge_data_option=1
                    while (merge_data_option==1):                        
                        merge_data_msg = ('\nWould you like to add i.e. merge another data to the loaded data.'+
                               '\nPlease select an option [1 or 2]\n'+
                               '  [1] Yes to add \n'+
                               '  [2] No to proceed with data preparation.')
                        merge_data_option = int(input(merge_data_msg))
                        if (merge_data_option==1):
                            currentpath = json_file_path()
                            print (currentpath)

                        elif(json_structure_option==2):
                            df = pd.read_json(path_or_buf=json_file_name, lines=True)    
                            print (json_file_name)
                            print(df.info()) 

                        else:
                            print('Invalid Input')
            # Use the function
            json_file_handler()
            merge_data()
                
    except ValueError:
            print('Input Error')

In [39]:
data_loader()

Welcome to the Data Pre-processor of the LMI System. We will begin by loading the data sets into memory.

Please select an option [1, 2 or 0 to exit]: 
 [1] To Import JSON data format.
 [2] To Import from MongoDB.
 [0] To Exit.
 1

Load JSON Documents.
Press Enter to browse to the file location.
The selected file is: C:/Users/User/Desktop/Test Files/TestJson1.json

Please select an option [1 or 2] to indicate how JSON Objects were organized in the file:
  [1] An array of JSON objects seperated by a comma ',' as shown below:
    [{"Job_Id":"15","Location":"Aberdeen"}, {"Job_Id":"67","Location":"Dundee"}] 

  [2] One JSON object per line usually when JSON was exported from MongoDB as shown below:
    NB- Anything within {} is treated as a single line even if the text spans multiple lines.
    {"Job_Id":"11","Location":"Cork"}
    {"Job_Id":"30","Location":"England"}
2
C:/Users/User/Desktop/Test Files/TestJson1.json
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10101 entries, 0 to 101

NameError: name 'json_file_path' is not defined