#Assignment 3 - Part 2
#Normalizing the LinkedIn data
##Mofidul I. Jamal
##Web 2.0 Architectures & Algorithms
##Spring 2015

###Normalizing the data
This file will do the following:
* Load the file with the LinkedIn data that was written out in part 1
* Print out this data
* Perform a custom normalization of the job titles in the data by grouping job titles into roles based on keywords
* Write out the grouped job titles
* Print out the statistics on the number of job titles that were normalized and calculate the ratio of job titles which couldn't be normalized with my grouping method

###Load the file, print it out

In [4]:
import json
from prettytable import PrettyTable
#load the file that was written out during part 1
with open('data\\linkedin_connections_data.json') as data_file:
    json_data = json.load(data_file)
    pt = PrettyTable(field_names=['Id', 'Company', 'Title'])
    titles = []
    for entry in json_data:
        #print entry
        if 'values' in entry['positions']:
            for position in entry['positions']['values']:
                if 'name' in position['company']:
                    pt.add_row([entry['id'],position['company']['name'], position['title']])
                    titles.append({'title': position['title'], 'id': entry['id'], 'company': position['company']['name']})

#print it out
print pt


+------------+--------------------------------------------+--------------------------------------------------------+
|     Id     |                  Company                   |                         Title                          |
+------------+--------------------------------------------+--------------------------------------------------------+
| -3VRISJPnQ |       DataCore Software Corporation        |              Vice President Customer Care              |
| -3VRISJPnQ |             DataCore Software              |    Vice President Emerging Markets LATAM and CANADA    |
| -3VRISJPnQ |                  DataCore                  |               VP Distribution Operations               |
| -3VRISJPnQ |       DataCore Software Corporation        |                     VP Operations                      |
| D3o_4w9sNZ |        Florida Atlantic University         |         Academic Support Services Coordinator          |
| WcbJ87H4OQ |                 mPWR, LLC                  |     

###Find all tokens in job titles
The following block of code will split all the job titles and find counts for all the tokens (individual words) within all job titles and print them out.
This list of tokens was used as the basis for selecting the keywords used in the grouping algorithm.

In [8]:
from collections import Counter
from operator import itemgetter
tokens_first_pass = []
tokens_second_pass = []
#split titles into tokens
titles_only = [row['title'] for row in titles]
for title in titles_only:
    tokens_first_pass.extend([t.strip(',') for t in title.split()])
#further split any tokens that comprise a /
for token in tokens_first_pass:
    tokens_second_pass.extend([t.strip() for t in token.split('/')
                              if len(t) > 1])

pt = PrettyTable(field_names=['Token', 'Freq'])
pt.align = 'l'
c = Counter(tokens_second_pass)
[pt.add_row([token, freq]) 
 for (token, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) 
     if freq > 0 and len(token) > 2] # we only care about freq > 1
print pt

+----------------+------+
| Token          | Freq |
+----------------+------+
| Engineer       | 30   |
| Software       | 28   |
| Manager        | 13   |
| Developer      | 11   |
| Test           | 8    |
| Development    | 8    |
| Product        | 6    |
| CEO            | 5    |
| Web            | 4    |
| Senior         | 4    |
| Owner          | 4    |
| Assistant      | 4    |
| and            | 4    |
| Director       | 3    |
| Coordinator    | 3    |
| Founder        | 3    |
| Staff          | 3    |
| Care           | 3    |
| President      | 3    |
| Customer       | 3    |
| Lead           | 2    |
| Freelance      | 2    |
| Mobile         | 2    |
| Project        | 2    |
| Architect      | 2    |
| Operations     | 2    |
| Renewals       | 2    |
| Sr.            | 2    |
| Ruby           | 2    |
| Professor      | 2    |
| App            | 2    |
| Design         | 2    |
| ISR            | 2    |
| Support        | 2    |
| Vice           | 2    |
| Co-Founder

###Group job titles into roles based on keywords
The following block of code will iterate through all the job titles and try to match each job title to a possible job role based on a keyword search. The keywords that are searched for were hand picked from the token list. Finally the findings will be printed out and written to a file called groups_output.txt

In [9]:
#we will normalize the titles by grouping them based on keywords
groupings = [
    {'name': 'Engineering/Development', 'keywords': ['Engineer', 'Software', 'Development', 'Architect', 'Developer', 'Programmer', 'Engineering']},
    {'name': 'Management', 'keywords': ['Manager', 'Director', 'Coordinator', 'Supervisor', 'Managing', 'Administrative']},
    {'name': 'Leadership', 'keywords': ['CEO', 'COO', 'CTO', 'CFO', 'VP', 'Vice', 'President', 'Owner', 'Founder', 'Co-Founder', 'Chair']},
    {'name': 'Education', 'keywords': ['Professor', 'Graduate', 'Teaching', 'Instructor', 'Academic', 'Faculty', 'Teacher', 'Student']},
    {'name': 'Sales', 'keywords': ['Sales', 'ISR']}
    ]

group_table = PrettyTable(field_names=['Group', 'Title', 'Company'])
matched_titles = []
for group in groupings:
    group_name = group['name']
    group_keywords = group['keywords']
    for row in titles:
        if any(keyword in row['title'] for keyword in group_keywords):
            group_table.add_row([group_name, row['title'], row['company']])
            matched_titles.append((row['title'], row['company']))
print group_table

#write pretty table output to file
table_txt = group_table.get_string()
with open('groups_output.txt','w') as file:
    file.write(table_txt)

+-------------------------+--------------------------------------------------------+-------------------------------------------+
|          Group          |                         Title                          |                  Company                  |
+-------------------------+--------------------------------------------------------+-------------------------------------------+
| Engineering/Development |                   Software Engineer                    |                   Citrix                  |
| Engineering/Development | Product Development/Customer Care/Renewals/ISR Manager |       DataCore Software Corporation       |
| Engineering/Development |                  Cognitive Architect                   |              MindSoft Bioware             |
| Engineering/Development |              Product Development Manager               |                   Citrix                  |
| Engineering/Development |                Staff Software Engineer                 |             

###Determine what wasn't matched
The following block of code will determine which job titles were not able to be matched by the grouping algorithm and provide a number which details the percent of the original data which was able to be matched, as well as print out the job titles which failed to be matched.

In [10]:
#the following is a little piece of code i used to output the unmatched titles
#basically i convert all the matched titles into a set
#and then get a set of my original titles
#perform a difference and output any titles that were not matched
matched_titles_set = set(matched_titles)
titles_set = set([(row['title'], row['company']) for row in titles])
missed_set = titles_set - matched_titles_set
print "Non matched titles:"
non_match_table = PrettyTable(field_names=['Title', 'Company'])
for element in missed_set:
    non_match_table.add_row([element[0], element[1]])
print non_match_table
print """Grouping was able to match %d titles out of a total of %d titles. 
This means that %.2f%% of all titles were matched with the 
grouping algorithm as presented here.""" % (len(matched_titles_set), len(titles_set), (float(len(matched_titles_set)) /len(titles_set))*100.0)

Non matched titles:
+--------------------------------------------------+--------------------------------------------+
|                      Title                       |                  Company                   |
+--------------------------------------------------+--------------------------------------------+
| Photographer/Digital Tech/Freelance Photographer |              Apertura64, Inc.              |
|                 Magento Monster                  |                 RubyBlanc                  |
|               Marketing Assistant                |          NAI/Merin Hunter Codman           |
|               Benefits Specialist                |             DataCore Software              |
|                File Administrator                |   Comiter, Singer, Baseman, & Braun, LLP   |
|                 Master's student                 |                    FAU                     |
|                    Web Design                    |       Wunderbar Consulting and Sales       |
