#**Extracting Entities from Contact Cards**
The provided excel file had unstructured text in the 'parsedTxt' column, making it challenging to extract specific details such as names, phone numbers, emails, and other contact information.

Initially, simple regular expressions (re) were used to extract information like names, phone numbers, emails, and addresses. However, the task of precisely identifying names was complicated due to various patterns and structures present in the data.To enhance name extraction, a list of top 10,000 English words was obtained from a reliable [source](https://www.mit.edu/~ecprice/wordlist.10000). This word list was utilized to filter out words that weren't commonly used in English, helping in distinguishing and filtering Indian names.

Further exploration was carried out using Natural Language Processing (NLP) libraries such as NLTK, spaCy, and Flair for Named Entity Recognition (NER). Despite their use, the accuracy was limited since these models were primarily trained on English datasets, and Indian names or specific addresses might not align perfectly with their training data.

Extraction of phone numbers, emails, and websites was relatively straightforward using regular expressions. However, extracting company names posed challenges due to the absence of a fixed pattern in the dataset. The Flair library was employed to identify 'organization' entities, but the results were not as accurate as expected.For job profiles, a keyword-based approach was used. A list of job-related keywords was employed, and regular expressions were utilized to filter and extract information that matched at least one word from this list.

In [None]:
# Import necessary libraries
import re  # Import the regular expression library for text processing
!pip install flair
import flair  # Import Flair for NLP tasks
import spacy  # Import Spacy for advanced NLP
import nltk  # Import NLTK for NLP operations
import pandas as pd  # Import Pandas for working with data
from nltk.stem import WordNetLemmatizer  # Import WordNetLemmatizer for word lemmatization
from flair.data import Sentence  # Import Sentence class from Flair for sentence-level processing
from flair.models import SequenceTagger  # Import SequenceTagger from Flair for sequence labeling
from nltk.tokenize import word_tokenize  # Import word_tokenize for word tokenization
from nltk.tag import pos_tag  # Import pos_tag for part-of-speech tagging
from nltk.chunk import ne_chunk  # Import ne_chunk for named entity chunking
import random # For generating random numbers




In [None]:
# Downloading essential NLTK resources:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
# Download the large English language model for spaCy
!python -m spacy download en_core_web_lg

2023-10-28 13:26:03.104970: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-28 13:26:03.105122: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-28 13:26:03.105263: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting en-core-web-lg==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m943.8 kB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can n

#**English words and Job Title Recognition**

In [None]:
common_words = set()
# Load the list of top 10 thousand common words in english from 'top10k.txt' file
with open('top10k.txt', 'r') as file:
    for line in file:
        word = line.strip()  # Remove leading/trailing whitespace
        common_words.add(word)
common_jobs = set()
# Load the list of common words used for jobs in english from 'jobs.txt' file
with open('jobs.txt', 'r') as file:
    text = file.read()
    words = text.split()
# Iterate through the words and add them to the set
    for word in words:
        word=word.lower()
        common_jobs.add(word)
def is_english_word(word):
    lemmatizer = WordNetLemmatizer()
    word=word.lower()
    # Convert word to its base form
    wl = lemmatizer.lemmatize(word, pos="n")
    # Check if the word is in the list of common words
    if wl in common_words:
        return True
    return False
def is_job(word):
    # Check if the word is in the list of common_jobs
    word=word.lower()
    if word in common_jobs:
        return True
    return False


#RE based functions to extract contact entity

In [None]:
# Function to extract Full name from text
def extract_name(text):
    # Define a regular expression pattern to match names
    name_pattern = r'(M(r|s|rs)\.\s)?[A-Z]([A-Z]+|[a-z]+)\s[A-Z]\w*\s'

    # Find all matches in the text
    matches = re.finditer(name_pattern, text)

    # List to store matched names
    matched_names = []

    # Iterate through the matches and store them in the list
    for match in matches:
        matched_names.append(match.group())

    # Iterate through the matched names
    for entity in matched_names:

        # Check if the entity has at least two words
        if entity and len(entity.split()) >= 2:
            flag = 0
            words = entity.split()

            # Check each word in the entity
            for word in words:
                # Check if the word is an English word or if it's too short
                if is_english_word(word) or len(word) <= 2:
                    flag = 1

            # If no problematic words were found, consider it a valid name
            if flag == 0:
                return entity


# Function to extract job titles from text
def extract_job(text):
    # Define a regular expression pattern to match job titles
    job_pattern = r'(M(r|s|rs)\.\s)?[A-Z]([A-Z]+|[a-z]+)\s[A-Z]\w*\s'

    # Find all matches in the text
    matches = re.finditer(job_pattern, text)

    # List to store matched job titles
    matched_jobs = []

    # Iterate through the matches and store them in the list
    for match in matches:
        matched_jobs.append(match.group())

    # Iterate through the matched job titles
    for entity in matched_jobs:
        #print(entity)

        flag = 0
        words = entity.split()

        # Check each word in the entity
        for word in words:
            # If a word does not match the criteria for a job title, increment the flag
            if not is_job(word):
                flag += 1

        # If there is at most one non-job title word, consider it a valid job title
        if flag <= 1:
            return entity
# Function to extract phone numbers from text
def extract_numbers(text):
    # Define a regular expression pattern to match phone numbers
    number_pattern = r'(\+91[\-\s]?)?\d{5}(\s|\s{2}|-)?\d{5}'

    # List to store matched phone numbers
    matched_numbers = []

    # Find all matches in the text
    matches = re.finditer(number_pattern, text)

    # Iterate through the matches and store them in the list
    for match in matches:
        matched_numbers.append(match.group())
        #print(match)

    return matched_numbers

# Function to extract email addresses from text
def extract_emails(text):
    # Define a regular expression pattern to match email addresses
    email_pattern = r'[a-zA-Z0-9.-]+(\s)?@[a-zA-Z0-9-\s]+(\.|,)?(com|net|org|in|edu|pe|ai|money|tech|one|ailen|org.in|co.in)'
    #r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b'
    #r'[a-zA-Z0-9.-]+(\s)?@[a-zA-Z0-9-\s]+(\.|,)?(com|net|org|in|edu|pe|ai|money|tech|one|ailen|org.in|co.in)'

    # List to store matched email addresses
    matched_emails = []

    # Find all matches in the text
    matches = re.finditer(email_pattern, text)

    # Iterate through the matches and store them in the list
    for match in matches:
        matched_emails.append(match.group())
        #print(match)

    return matched_emails

# Function to extract website URLs from text
def extract_website(text):
    # Define a regular expression pattern to match website URLs
    url_pattern = r'\s(https?://)?(www\.|WWW\.|wwW\.|Www\.)?(\w+|[a-zA-Z-]+)\.(\s)?(com|net|org|in|pe|edu|ai|money|tech|one|ailen|org.in|co.in)'

    # List to store matched website URLs
    matched_urls = []

    # Find all matches in the text
    matches = re.finditer(url_pattern, text)

    # Iterate through the matches and store them in the list
    for match in matches:
        matched_urls.append(match.group())

    return matched_urls

# Function to extract addresses from text
def extract_address(text):
    # Define regular expressions to match different parts of an address
    address_start = re.compile(r'\s[A-Za-z\.0-9]+?\s\w+\,')
    address_end = re.compile(r'\,(\s|\s{2}|\s{3})[A-Za-z\.0-9]+\s([A-Za-z]+)?(\W+)')
    pincode = re.compile(r'\s\d{6}\s')

    matched_address = []

    # Find the start of the address
    start_matches = address_start.finditer(text)
    for match in start_matches:
        matched_address.append(match.group())
        # Only consider the first match
        break

    # Find the middle part of the address
    middle_address = re.compile(r"(?<=\,)[^\,]+(?=\,)")
    middle_matches = middle_address.finditer(text)
    for match in middle_matches:
        matched_address.append(match.group())

    # Find the end of the address
    end_matches = address_end.finditer(text)
    flag = 0
    for match in end_matches:
        flag = 1
        end_match = match.group()

    # If an end match was found, add it to the address
    if flag == 1:
        matched_address.append(end_match)

    # Find the pincode
    pincode_matches = pincode.finditer(text)
    for match in pincode_matches:
        matched_address.append(match.group())

    # Concatenate the matched address parts
    address = ''
    for word in matched_address:
        address += word

    return address


#**NER(Name entity recognition) based functions to extract contact entity**

In [None]:
def extractfromspacy(text):
    # Load the English language model
    nlp = spacy.load("en_core_web_lg")

    # Process the text with spaCy
    doc = nlp(text)

    # Iterate through named entities in the text
    for ent in doc.ents:
        # Print the entity text and its label
        print(ent.text, "|", ent.label_)

        # Check if the entity is an organization (label 'ORG') and has a length of 10 characters or more
        if len(ent.text) >= 10 and ent.label_ == 'ORG':
            return ent.text

def extractfromnltk2(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    tree = ne_chunk(tagged)

    # Extract named entities
    named_entities = []

    for subtree in tree:
        if isinstance(subtree, nltk.Tree):
            entity = " ".join([word for word, tag in subtree.leaves()])
            entity_label = subtree.label()
            named_entities.append((entity, entity_label))

    # Print all named entities
    for entity, label in named_entities:
        print(f"Entity: {entity}, Label: {label}")
    for entity, label in named_entities:
        if "ORGANIZATION" in label and len(entity.split()) >= 2 :
            flag=0;
            words = entity.split()
            for word in words:
                if is_english_word(word):
                    flag=1
            if(flag==0):
                return entity

    # If no person entity is found, return None
    return None

# Load the NER (Named Entity Recognition) model
tagger = SequenceTagger.load("ner")
# Function to extract organizations from text using Flair
def extractfromflair(text):


    # Create a Flair Sentence object with the input text
    sentence = Sentence(text)

    # Use the NER model to predict named entities in the sentence
    tagger.predict(sentence)

    #named_entities = []

    # Iterate through the named entities recognized by Flair
    for entity in sentence.get_spans("ner"):
        #named_entities.append((entity.text, entity.tag))
        # Print the recognized entity and its tag (optional)
        # print(f"Entity: '{entity.text}' ({entity.tag})")

        # Check if the entity is an organization (tag 'ORG') and has a length of 10 characters or more
        if len(entity.text) >= 10 and entity.tag == 'ORG':
            return entity.text

    # If no ORG entity is found, return None
    return None


2023-10-28 13:27:39,832 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


In [None]:
df=pd.read_excel('MyContacts(1).xlsx') # Read data from the Excel file 'MyContacts(1).xlsx' into a Pandas DataFrame

In [None]:
text=df['parsedTxt'][random.randint(0, 170)] #Sample text
text = text.replace('\n','  ') #modifying text for RE
text = ' ' + text + ' '
print(text)

 Senior Associate UPI Product Development  Viraj Shah  National Payments Corporation of india  3rd Floor 302. Raheja Titanium,  Off. Western Express Highway,  Goregaon (E), Mumbai - 400 063  T: +919082054231  Email: viraj.shah@npci.org.in  Website: wwW.npci.org.in 


In [None]:
# Extract name (using RE and English words)
extract_name(text)

'Viraj Shah '

In [None]:
# Extract job title (using RE and common job related words)
extract_job(text)

'Senior Associate '

In [None]:
# Extract addresse (using purely RE)
extract_address(text)

' Raheja Titanium,  Off. Western Express Highway  Goregaon (E), Mumbai - '

In [None]:
 # Extract phone numbers (using purely RE)
extract_numbers(text)

['+919082054231']

In [None]:
# Extract website URL (using purely RE)
extract_website(text)

[' wwW.npci.org']

In [None]:
# Extract email addresses (using purely RE)
extract_emails(text)

['viraj.shah@npci.org']

In [None]:
# Extract company name (label ORG) using Flair (using ner)
extractfromflair(text)

'Viraj Shah  National Payments Corporation of india'

In [None]:
# Extract company name (label ORG) using spacy (using ner)
extractfromspacy(text)

Associate UPI Product Development | ORG


'Associate UPI Product Development'

In [None]:
# Extract entities using Flair (using ner)
extractfromnltk2(text)

Entity: Associate, Label: ORGANIZATION
Entity: Shah National Payments Corporation, Label: PERSON
Entity: Raheja Titanium, Label: PERSON
Entity: Off, Label: PERSON
Entity: Western Express Highway, Label: PERSON
Entity: Goregaon, Label: GPE
Entity: Mumbai, Label: GPE
Entity: Website, Label: PERSON


In [None]:
#going through whole dataset
for index, row in df.iterrows():
    # Check if the 'parsedTxt' column is not empty
    if pd.notna(row['parsedTxt']):
        text = row['parsedTxt']

    # Text preprocessing
    text2 = text.replace('\n', '  ')
    text2 = ' ' + text2 + ' '

    # Extract name (using RE and English words)
    name = extract_name(text2)
    if name:
        df.at[index, 'fullname'] = name
        text2 = text2.replace(name, ' ') #removing found entity for imporving further accuracy

    # Extract phone numbers (using purely RE)
    number = extract_numbers(text2)
    if len(number) == 2:
        text2 = text2.replace(number[0], ' ')
        text2 = text2.replace(number[1], ' ')
        df.at[index, 'phone'] = number[0]
        df.at[index, 'phone_2'] = number[1]
    elif len(number) == 1:
        text2 = text2.replace(number[0], ' ')
        df.at[index, 'phone'] = number[0]

    # Extract email addresses (using purely RE)
    email = extract_emails(text2)
    if len(email) == 2:
        df.at[index, 'email'] = email[0]
        text2 = text2.replace(email[0], ' ')
        df.at[index, 'email_2'] = email[1]
        text2 = text2.replace(email[1], ' ')
    elif len(email) == 1:
        df.at[index, 'email'] = email[0]
        text2 = text2.replace(email[0], ' ')

    # Extract website URL (using purely RE)
    website = extract_website(text2)
    if website:
        df.at[index, 'website'] = website[0]
        text2 = text2.replace(website[0], ' ')

    # Extract address (using purely RE)
    address = extract_address(text2)
    if address:
        df.at[index, 'address'] = address

    # Extract job title (using RE and common job related words)
    job_title = extract_job(text2)
    if job_title:
        df.at[index, 'job_title'] = job_title

    # Extract company name using Flair (using ner)
    company_name = extractfromflair(text)
    if company_name:
        df.at[index, 'company'] = company_name


In [None]:
# Save the DataFrame to an Excel file
df.to_excel("MyContacts2.xlsx",index=False)

In [None]:
df.head(5)

Unnamed: 0,parsedTxt,fullname,company,job_title,address,phone,phone_2,email,email_2,website
0,Making Payment Simpter\nSambhav Pay\n+91-70654...,Mrs. Sapna Raghav,Udyog Vihar,BUSINESS HEAD,"Udyog Vihar, Haryana B21 Phase-5 Udyog Vihar...",+91-7065483258,,ops@sambhavpay.com,,www.sambhavpay.com
1,Making Payment Simpier\nSambhav Pay\n+91-88824...,Mr. Jayant Mallick,,,"Udyog Vihar, Haryana B21 Phase-5 Udyog Vihar...",+91-8882484147,,jayant@sambhavpay.com,,www.sambhavpay.com
2,Vaibhavi Kamath\nExecutive Assistant to CEO\nK...,Vaibhavi Kamath,KNIGHT FINTECH,Executive Assistant,,+919136706988,,vaibhavi@knightfintech.com,,www.knightfintech.com
3,aytring\nDebal Chakraborty\nCo-Founder\nOFfice...,Debal Chakraborty,,,"MM Towers Sector-18 Gurgaon, MM Towers",+91 9711192256,,debal@paytringcom,,www.paytring.com
4,dheerajafinarkein.com\nG +91 83296 07320\nChie...,Dheeraj Kumar,,Chief Technology,,+91 83296 07320,,,,dheerajafinarkein.com


#**Summary**
The process involved a series of attempts using both regular expressions and NLP-based techniques to extract specific entities from the unstructured text. While some information like phone numbers, emails, and websites could be extracted fairly accurately from RE, challenges were encountered, especially in identifying names, company names, and job titles due to the varied nature of the data. Notably, Named Entity Recognition using **Flair** was utilized specifically for extracting **company names** due to its relatively higher accuracy compared to Spacy and NLTK NER models. However, its performance was below average in accurately identifying entities in the provided text.