<a href="https://colab.research.google.com/github/AbhiSaphire/Colaboratory/blob/master/HGP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HGP Internship Assignment
## According to given Problem Statement:
**Given an input string from a user, I need to parse it into components to be used for further processing.
These components will be best matches against predefined lists and / or scalars.**



---



## My Approach to problem
Divide and conquer all todos separately.
**For detailed working of Solution please refer the PDF provided**

**Entire Process​ -**
1. Use regex to separate Time period and Unit of time period and split them to store them separately.
2. Preprocess the Data.
3. Separate Sectors by topic modelling using LDA, again use topic modelling to get sub-topics or fundamentals with corpus of fundamental_docs.
4. Use Contextual similarity and Syntactic similarity algorithms of separated Sector names and fundamental name (for proper spell check and synonym check).
5. Store Sector, Fundamental, Attributes of Fundamentals, Time Period, Unit of Time period as keys in a dictionary and append their values. Return dictionary!

## Step 1 - Using Regex

In [2]:
import re
string = "Output Revenue, EBITDA margin for Steel and Metal stocks for past 10 qtrs"
match = re.search("\d+\s*\w+", string)
print(match.group())

10 qtrs


## Step 2 - Preprocess Data

In [19]:
import gensim
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
nltk.download('wordnet')

def lemmatize_stemming(text):
  stemmer = PorterStemmer()
  return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
  result=[]
  for token in gensim.utils.simple_preprocess(text) :
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
      result.append(lemmatize_stemming(token))
  return result

processed_string = preprocess(string)
print(processed_string)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
['output', 'revenu', 'ebitda', 'margin', 'steel', 'metal', 'stock', 'past', 'qtr']


## Step 3 - BoW Conversion and Applying LDA

In [0]:
# sectors_docs need to be supplied to create a Bag of Words from it.

import pandas as pd
sectors_docs = pd.read_csv("Some_Sector_Document.tsv",delimiter ="\t", quoting =3)
dictionary = gensim.corpora.Dictionary(sectors_docs)
bow_corpus = [dictionary.doc2bow(doc) for doc in sectors_docs]

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 2, id2word = dictionary, passes = 10, workers = 2)

## Step 4 - Testing LDA model on unseen data

In [0]:
bow_vector = dictionary.doc2bow(processed_string)

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

## Step 5 - Handling Contextual Similarity using WordNets synsets (Pseudo Code)

This is a Pseudo code for checking contextual similarity between two words and
returning True if words are more than 70% similar.

While developing we can check this with processed_string against bow_corpus

In [23]:
from nltk.corpus import wordnet as wn
from itertools import product

def contextual_similarity():
  wordx, wordy = "revenue","sales"
  sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)
  maxscore = 0
  for i,j in list(product(sem1,sem2)):
    score = i.wup_similarity(j)
    maxscore = score if maxscore < score else maxscore
  return True if maxscore > 0.70 else False

print(contextual_similarity())

True


## Step 6 - Handling Syntactical Similarity using SymSpell

In [0]:
from collections import Counter
from sklearn.datasets import fetch_20newsgroups
import re

corpus = []

for line in fetch_20newsgroups().data:
  line = line.replace('\n', ' ').replace('\t', ' ').lower()                              #This is a Pseudo Code
  line = re.sub('[^a-z ]', ' ', line)                                                    #Here newspaper corpus is made
  tokens = line.split(' ')                                                               #Just to show how SymSpell works
  tokens = [token for token in tokens if len(token) > 0]                                 #While building real model we can use suitable corpus
  corpus.extend(tokens)
corpus = Counter(corpus)
corpus_dir = '../'
corpus_file_name = 'dorian_gray.txt'
symspell = SymSpell(verbose=10)
symspell.build_vocab(dictionary=corpus, file_dir=corpus_dir, file_name=corpus_file_name)
symspell.load_vocab(corpus_file_path=corpus_dir+corpus_file_name)
results = symspell.correction(word='helol')
print(results)