# Title: Exploring Textual Patterns / Performing Information Extractions

The primary context of this notebook will be to finalize of extracting valuable insights from the news articles and see if we can really focus on extracting out of the box relationships as well. The following steps would suffice this notebook:  
- Text Preprocessing  
- Rule 1 for IE: Noun-Verb-Noun Extraction  
- Rule 2 for IE: Adjective-Noun Extraction  
- Rule 3 for IE: Preprosition-Noun Extraction  
- Rule 4 for IE: Combination of NVN + AD Extarction based rules

Details for each section could be explored in following sections

## Generic Actions

In [1]:
os.chdir(os.path.dirname(os.getcwd()))
os.getcwd()

'c:\\Users\\manash.jyoti.konwar\\Documents\\AI_Random_Projects\\NLP-Information-Pattern-Finder'

### Libraries Import

In [2]:
import multiprocessing
import pandas as pd
import dask.dataframe as dd

from dask.diagnostics import ProgressBar
ProgressBar().register()

from sn_textual_preprocessing import *

### Notebook Variables

In [3]:
# Input file path
input_filepath = os.path.join('input', 'news_articles_dataset.csv')

# Derived file path
output_path = 'output'
if not os.path.exists(output_path):
    os.makedirs(output_path)

sample_frac = 0.1

ouptut_overall_data = os.path.join(output_path, 'df_nvn_news.csv')
ouptut_nvn_sep_data = os.path.join(output_path, 'df_nvn_sep_news.csv')

### Reading data

In [4]:
input_data = pd.read_csv(input_filepath)
input_data.columns = [col_name.upper() for col_name in input_data.columns]
input_data.shape

(2225, 3)

In [5]:
input_data = input_data.groupby('CATEGORIES', group_keys=False).apply(lambda x: x.sample(frac=0.1, random_state=42))
input_data.shape

(223, 3)

In [6]:
input_data.CATEGORIES.value_counts()

business         51
sport            51
politics         42
tech             40
entertainment    39
Name: CATEGORIES, dtype: int64

## Text Preprocessing  

The steps are as follows:  
- Remove mentions and hashtags  
- Remove URLs  
- Remove contractions  
- Remove stopwords and punctuations  
- Lemmatize all words amd lower case each of them  
- Remove redundant domain specific words  
- Remove extra spaces 

In [7]:
def preprocess_text(text):
    result = remove_urls(text)
    result = remove_mentions_hashtags(result)
    result = remove_contractions(result)
    result = remove_stopwords_punc_nos(result, 
                                       remove_stopwords_flag=False, 
                                       punc_2_remove=string.punctuation.replace('-','').replace('%','').replace('.',''), 
                                       remove_digits_flag=False,
                                       remove_pattern_punc_flag=True)
    result = remove_extra_spaces(result)
    return result

In [8]:
input_data['PREPROCESSED_TEXT'] = dd.from_pandas(input_data.ARTICLES, npartitions=4*multiprocessing.cpu_count()).map_partitions(lambda dframe: dframe.apply(lambda row: preprocess_text(row))).compute(scheduler='processes')

[########################################] | 100% Completed | 6.78 ss


In [9]:
input_data

Unnamed: 0,ARTICLES,SUMMARIES,CATEGORIES,PREPROCESSED_TEXT
480,Christmas sales worst since 1981\n\nUK retail ...,"""The retail sales figures are very weak, but a...",business,Christmas sales worst since 1981\n\nUK retail ...
449,US retail sales surge in December\n\nUS retail...,US retail sales ended the year on a high note ...,business,US retail sales surge in December\n\nUS retail...
475,Saudi NCCI's shares soar\n\nShares in Saudi Ar...,Shares in Saudi Arabia's National Company for ...,business,Saudi NCCIs shares soar\n\nShares in Saudi Ara...
434,Fosters buys stake in winemaker\n\nAustralian ...,Australian brewer Fosters has bought a large s...,business,Fosters buys stake in winemaker\n\nAustralian ...
368,Beer giant swallows Russian firm\n\nBrewing gi...,Inbev was formed in August 2004 when Belgium's...,business,Beer giant swallows Russian firm\n\nBrewing gi...
...,...,...,...,...
1940,Joke e-mail virus tricks users\n\nA virus that...,Security firm Network Box said that it stopped...,tech,Joke e-mail virus tricks users\n\nA virus that...
1937,Games maker fights for survival\n\nOne of Brit...,The administrators told BBC News Online that s...,tech,Games maker fights for survival\n\nOne of Brit...
2223,US cyber security chief resigns\n\nThe man mak...,Amit Yoran was director of the National Cyber ...,tech,US cyber security chief resigns\n\nThe man mak...
1982,Freeze on anti-spam campaign\n\nA campaign by ...,This is likely to be in response to spammers w...,tech,Freeze on anti-spam campaign\n\nA campaign by ...


## Rule 1 for IE: NVN Extraction

## Rule 2 for IE: AN Extraction

## Rule 3 fro IE: PN Extraction

## Rule 4 for IE: Combination of NVN + AD Extraction based rules