# Data Wrangling


## Wrong output cases:
- Output is not valid if you worked in one company on different positions (Examples: FaikC, OmarD) **`DONE`**
- Output is not valid if info part is divided between two pages (Example: NedimC) **`DONE`**

Those are tricky ones, but I am going to figure it out. 

In [1]:
import spacy
import pandas as pd
import re
import en_core_web_sm
import slate3k as slate

In [2]:
def pdf_to_text(name):
    """
    Returns extracted text from pdf as list of strings. 
    Each element of list is one page of pdf document.

    """   
    with open("Resumes/LinkedIn/" + name + ".pdf", 'rb') as f:
        extracted_text = slate.PDF(f)
    
    return extracted_text

### Read pdf and concatenate pages in one string

In [3]:
nlp = en_core_web_sm.load() #load small spacy model

In [4]:
def concat_pages(filename):
    """Concatenate pages in one string, and returns that string"""
    cv = pdf_to_text(filename)
    text = ""
    
    for page in cv:
        text+=page
    
    return text

In [5]:
doc = nlp(concat_pages("FaikC"))



**Here is the output of a pdf file**

In [6]:
text = doc.text

In [7]:
text

"Contact\n\nfaik.catibusic@gmail.com\n\nwww.linkedin.com/in/faik-\ncatibusic-55190335 (LinkedIn)\n\nFaik Catibusic\n\nFounder at Ozon\n\nSummary\n\nTop Skills\n\n.NET\n\nASP.NET MVC\n\nC#\n\nLanguages\n\nEnglish (Full Professional)\n\nBosnian (Native or Bilingual)\n\nCertifications\n\nStress and conflict management\n\nTeamwork and decision making\n\nFunctional Programming Principles\nin Scala by École Polytechnique\nFédérale de Lausanne on Coursera\n\nFaik is a problem solver at heart, passionate about crafting software,\nwho loves to introduce cutting-edge technologies to new and legacy\nprojects alike. He is capable of complex software system design and\nis actively involved in their implementation. Lately, he has focused\non the design and development of micro-services architecture and\ntransformation of monolithic systems into ones based on micro-\nservices.\n\nExperience\n\nOzon\n\nFounder\n\nOctober 2019\xa0-\xa0Present\xa0\n\nAutomated ranking system for job applicants based on 

In [8]:
def clean_string(text):

    '''
    Remove part of text which contains something like this:
        Page 2 of 3\n\n\xa0\n\xa0\n\xa0\n\x0c
        
    This part does not contain any valuable information, and sometimes occures wrong answer in the output.
    '''
    
    pdf = text
    end = 0
    result = ""
    
    expression = r"Page \d of \d\n\n\xa0\n\xa0\n\xa0\n\x0c" 
    
    for match in re.findall(expression, text): #method `findall` returns all matched words as an array
        word = re.search(match, pdf)  #search returns start and the end of a word
        pdf = pdf[word.end():] #    pdf = pdf[word.end():] #search returns result after first match, so on the next iteration text file should be from the end of the word, to the end of a document
        result+=text[end:word.start() + end] #read text from the end of a last word to the start of the new word 
        end += word.end()
    
    return result

In [9]:
text = clean_string(text)

**Here I tried to get start and end of keywords**

In [10]:
def get_companies(text, expression):
    
    """
    Function finds start and the end of a pattern, and put those numbers in a list.
    List of those integers is being returned.
    """
    keys = []
    pdf = text
    end = 0
    
    for match in re.findall(expression, text):
        word = re.search(expression, pdf)
        pdf = pdf[word.end():]
        keys.append([word.start() + end, word.end() + end])
        end += word.end() 
    
    return keys

In [11]:
present_keywords = get_companies(text, r"\n\n.+\d{1,4}\xa0-\xa0Present\xa0\n")

## Get company and title

In [12]:
def get_info(start):
    """
    This function returns company and title.
    
    Parameters:
    start (int): Start of a word
    
    Returns:
    list[str, str]: [Name of the company, Title in that company]
    
    """
    
    string = ""
    for i in reversed(range(start)):
        string+=text[i]
        if("\n\n" in string[::-1]): #reverse an array and check for `\n\n`(end of a word) part in the string. 
            title = string[::-1].strip() #remove /n
            
            string = ""
            for j in reversed(range(i)):
                string+=text[j]
                if("\n\n" in string[::-1]):         
                    company = string[::-1].strip()
                    return [company, title]
                            

In [13]:
companies = []

for key in present_keywords:
    companies.append(get_info(key[0])) #get company and title for every date

In [14]:
present_dates = [text[keyword[0]:keyword[1]].strip() for keyword in present_keywords]

#This part above is just made for second dataframe. It can be deleted.

df = pd.DataFrame({
    "Company": [row[0] for row in companies],
    "Title": [row[1] for row in companies],
    "Date": present_dates
    
})

In [15]:
df

Unnamed: 0,Company,Title,Date
0,Ozon,Founder,October 2019 - Present
1,73lab,Founder,August 2019 - Present
2,Apres,Advisor & Co-Founder,March 2018 - Present
3,Sarajevo School of Science and Technology,Lecturer,2016 - Present
4,Toptal,Senior Software Engineer,April 2015 - Present


## Get companies from past

In [16]:
def get_past_companies(text):

    keys2 = []
    pdf = text
    end = 0
    
    expression = r'\n\n.+\d{1,4}\xa0-\xa0.+\d{1,4}\xa0.+\n'
    
    for match in re.findall(expression, doc.text):
        print()
        date.append(match.strip())
        word = re.search(r'\n\n.+\d{1,4}\xa0-\xa0.+\d{1,4}\xa0.+\n', pdf)
        pdf = pdf[word.end():]
        keys2.append([word.start() + end, word.end() + end])
        end += word.end() 
        
    return keys2

In [17]:
past_companies = get_companies(text, r'\n\n.+\d{1,4}\xa0-\xa0.+\d{1,4}\xa0.+\n')

In [18]:
for word in past_companies:
    companies.append(get_info(word[0]))

In [19]:
past_dates = [text[keyword[0]:keyword[1]].strip() for keyword in past_companies]
present_dates += past_dates

#This part above is just made for dataframe below. It can be deleted.

df = pd.DataFrame({
    "Company": [row[0] for row in companies],
    "Title": [row[1] for row in companies],
    "Date": present_dates
    
})

In [20]:
df

Unnamed: 0,Company,Title,Date
0,Ozon,Founder,October 2019 - Present
1,73lab,Founder,August 2019 - Present
2,Apres,Advisor & Co-Founder,March 2018 - Present
3,Sarajevo School of Science and Technology,Lecturer,2016 - Present
4,Toptal,Senior Software Engineer,April 2015 - Present
5,Ant Colony,Co-Founder,September 2016 - July 2019 (2 years 11 months)
6,Academy387,Lecturer,June 2014 - December 2015 (1 year 7 months)
7,"Maestral Solutions, Inc.",Solution Architect,September 2014 - August 2015 (1 year)
8,2 years 11 months,Software Architect,August 2013 - April 2014 (9 months)
9,"micro services, SharePoint, Team Foundation Se...",Software Engineer,June 2011 - August 2013 (2 years 3 months)


# Solving specific cases

In [21]:
titles = [row[1] for row in companies]
companies = [row[0] for row in companies]

In [22]:
def find_keywords(text):
    """
    This function should find all parts in text that are like this:
        \n\n2 years 11 months\n\n
    
    If statements fix some inaccuracies of regex.
    
    Parameters:
    text (string): Text of a pdf
    
    Returns:
    matched_words (list): Matched words in list. Those words mustn't be ("less than a year", "2016-2020 (4 years)") or something like that. 
    """

    expression = r'\n\n\d.+\n\n'
    matched_words = []
    
    for match in re.findall(expression, text):
        doc = nlp(match)
        for token in doc:
            if(token.lemma_ == "less" or token.lemma_ == "-"):
                break
            if(token.lemma_ == "year" or token.lemma_ == "month"):
                matched_words.append(match)
                break
                
    return matched_words

In [23]:
matched_words = find_keywords(text)

In [24]:
matched_words

['\n\n2 years 11 months\n\n']

In [25]:
def get_dates_below(matched_words):
    """
    As you can see in the text, after those keywords there are some dates. If you add first N of those
    dates below you should get value which is equal to the keyword. Example:
    
    Keyword: 2 years 11 months 
    Dates below: 
    - August 2013 - April 2014 (9 months)
    - June 2011 - August 2013 (2 years 3 months) 
    - March 2010 - June 2011 (1 year 4 months) 
    - June 2008 - May 2009 (1 year) 
    ...
    
    If you add first two dates, you can realize that the person in that period worked in the same company.
    
    Parameters:
    matched_words (list): keywords which met the rules
    
    Returns:
    array (list): dates below keyword
    pom (list): start and the end of dates below keyword
    
    """

    keys = []
    end = 0
    pdf = text
    array = []
    pom = []
    m = []
    position_of_keyword = []
    
    for matched_word in matched_words:
        for match in re.findall(matched_word, text):
            keys = []
            word = re.search(matched_word, pdf)
            pdf = pdf[word.end():]
            
            keys.append([word.start() + end, word.end() + end])
            position_of_keyword+=(keys)
            end += word.end() 
            
            array.append([text[key[0]:key[1]] for key in past_companies if keys[0][1] < key[1]])
            pom.append([[key[0], key[1]] for key in past_companies if keys[0][1] < key[1]])
    
    return array, pom, position_of_keyword

In [26]:
array, pom, position_of_keyword = get_dates_below(matched_words)

In [27]:
array

[['\n\nAugust 2013\xa0-\xa0April 2014\xa0(9 months)\n',
  '\n\nJune 2011\xa0-\xa0August 2013\xa0(2 years 3 months)\n',
  '\n\nMarch 2010\xa0-\xa0June 2011\xa0(1 year 4 months)\n',
  '\n\nJune 2008\xa0-\xa0May 2009\xa0(1 year)\n',
  '\n\nAugust 2007\xa0-\xa0September 2007\xa0(2 months)\n']]

In [28]:
def get_period(array):
    """
    Get part of matched words from brackets. Example:
    
    Matched word:
    '\n\nAugust 2013\xa0-\xa0April 2014\xa0(9 months)\n'
    Output should be:
    (9 months)
    
    Parameters:
    array (list): list of dates below keyword
    
    Returns:
    array (list): list of elapsed times between dates
    
    """

    experience = []
    expression = r'\(.+\)'

    for i in range(len(array)):
        for j in range(len(array[i])):
            word = array[i][j]
            for match in re.findall(expression, word):
                array[i][j] = match
    
    return array

In [29]:
array = get_period(array)

In [30]:
array

[['(9 months)',
  '(2 years 3 months)',
  '(1 year 4 months)',
  '(1 year)',
  '(2 months)']]

In [31]:
def get_year_and_month(word):
    """
    Get year and month from preprocessed data.
    
    Parameters:
    word (string): elapsed times between dates
    
    Returns:
    (list): Number of years, Number of months
    """
    doc = nlp(word)
    month = 0
    year = 0
    word = 0
    
    for i in range(len(doc)):
        if(doc[i].lemma_ == "year"):
            year = doc[i-1]
        if(doc[i].lemma_ == "month"):
            month = doc[i-1]
    return [int(str(year)), int(str(month))]

In [32]:
for p in range(len(matched_words)):
    
    target_year = get_year_and_month(matched_words[p])[0]
    target_month = get_year_and_month(matched_words[p])[1]
    
    month = 0
    year = 0
    
    string = ""
    
    #Get name of a company. Name of a company is placed above keyword.
    for i in reversed(range(position_of_keyword[p][0])):
        string+=text[i]
        if("\n\n" in string[::-1]):
            cmp = string[::-1].strip()
            break
 
    for j in range(len(array[p])):
        
        #Add months and years
        month += get_year_and_month(array[p][j])[1]
        year += get_year_and_month(array[p][j])[0]
            
        #Check if they are equal to the target_year and target_month
        if(year + (month-1)//12 == target_year and (month-1)%12== target_month):
            
            #Chnage name of a company for all wrong answers
            start = companies.index(matched_words[p].strip())
            companies[start:start+j+1] = [cmp for i in range(j+1)]
            break

In [33]:
df = pd.DataFrame({
    "Company": companies,
    "Title": titles,
    "Date": present_dates
    
})

df

Unnamed: 0,Company,Title,Date
0,Ozon,Founder,October 2019 - Present
1,73lab,Founder,August 2019 - Present
2,Apres,Advisor & Co-Founder,March 2018 - Present
3,Sarajevo School of Science and Technology,Lecturer,2016 - Present
4,Toptal,Senior Software Engineer,April 2015 - Present
5,Ant Colony,Co-Founder,September 2016 - July 2019 (2 years 11 months)
6,Academy387,Lecturer,June 2014 - December 2015 (1 year 7 months)
7,"Maestral Solutions, Inc.",Solution Architect,September 2014 - August 2015 (1 year)
8,Authority Partners Inc.,Software Architect,August 2013 - April 2014 (9 months)
9,Authority Partners Inc.,Software Engineer,June 2011 - August 2013 (2 years 3 months)
