<a href="https://colab.research.google.com/github/TamedTiger18/hello-world/blob/master/Text_visulisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Making Text Beautiful
Josh Taylor 2019


## Imports:<br>
The filters used in this analysis require SpaCy >2.0. Uncomment out the pip install if this is required:

In [0]:
import os
import numpy as np
import pandas as pd
#!pip install spacy --upgrade # we need the latest version of SpaCy for this work
!python -m spacy download en_core_web_md
import spacy
#!pip install scattertext
import scattertext as st
import en_core_web_sm
from IPython.display import HTML

nlp = en_core_web_sm.load()

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


## Get the data 

The below downloads a Pandas Dataframe which is publically hosted on Google Drive (this should therefore work for anyone)

In [0]:
import requests

def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params = { 'id' : id }, stream = True)
    token = get_confirm_token(response)

    if token:
        params = { 'id' : id, 'confirm' : token }
        response = session.get(URL, params = params, stream = True)

    save_response_content(response, destination)    

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value

    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)


file_id = '1M_XljfV5t_nGjvhyfTPO9n2nfOweMwYx'
destination = 'temp'
download_file_from_google_drive(file_id, destination)

df = pd.read_pickle('temp')

#small amount of cleaning:
df.drop(['FT_tfidf'],axis=1, inplace=True)
df = df.loc[df.error ==0]

print('Shape of Data Frame: %s' %(df.shape,))

Shape of Data Frame: (4873, 11)


In [0]:
#Data cleaning to clean up year values and remove blanks:
df['Period Covered'].value_counts()

def yearClean(yr):
  if '-' in str(yr):
    yr = yr.split('-')[1]
  return yr
  
df['Year'] = df['Period Covered'].apply(yearClean)
#delete blank years:
df['Year'].replace('', np.nan, inplace=True)
df.dropna(subset=['Year'], inplace=True)

## Phrase extraction:

####Filtering for testing:

In [0]:
#test for stripping out entities
test2 = Output[2]
test = nlp(Output[0])
for i in test.ents:
  if i.label_ == 'ORG':
    print(i.text)
    test2 = test2.replace(i.text, "ORG")


the Honda Group
the Honda Group
The Honda Group
Supplier CSR Guidelines
the Honda Group’s
Honda
MSA
Honda
The Honda Group
The Honda Group
The Honda Group
the U.S. Securities and Exchange Commission
the Honda Group’s
HUM
the Supplier Ethics Line
Honda
Honda
the Supplier Ethics Line
HUM
the Product Compliance & Sustainability Team
The Supplier Ethics Line
Honda
HUM
HUM
Honda
the Self-Assessment Questionnaire
CSR Guidelines
Sustainability
Honda
HME-L
Key Performance Indicator
HUM
Honda Group
Honda
Anti-Slavery
Honda
MSA
Honda


In [0]:
%%time
from spacy.matcher import Matcher
from spacy import displacy
import re

text_list = df.loc[df.Company.str.contains('Honda')==True].text.values[0:10] #list of all the text items - typically the text column of the df

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start:end]  # Matched span
    sent = span.sent  # Sentence containing matched span
    # Append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{
        "start": span.start_char - sent.start_char,
        "end": span.end_char - sent.start_char,
        "label": "MATCH",
    }]
    matched_sents.append({"text": sent.text, "ents": match_ents})
    
    
matcher = Matcher(nlp.vocab)
#this type of pattern matching requires SpaCy >2.1:

pattern = [{'POS': {'IN': ['PROPN', 'PRON']}, 'LOWER': {'NOT_IN': ['they','who','you','it','us']}  },
           {'POS': 'VERB', 'LOWER': {'NOT_IN': ['may','might','could']}  },
           {'POS': {'IN': ['VERB', 'DET']}, 'LOWER': {'NOT_IN': ['a']}}]
matcher.add("commit", collect_sents, pattern)

pattern = [{'POS': {'IN': ['PROPN','PRON']}, 'LOWER': {'NOT_IN': ['they','who','you','it','us']}  },
           {'POS': 'VERB', 'LOWER': {'NOT_IN': ['may','might','could']}},
           {'POS': 'ADJ'},
           {'POS': 'ADP'}]
matcher.add("commit", collect_sents, pattern)



Output = []
for txt in text_list:
  matched_sents = []  # Collect data of matched sentences to be visualized
  text = txt
  text = re.sub(r'[\s+]{3,}', '. ', text) # multiple whitespace removed and full stop added
  text = re.sub(r'\s+', ' ', text) #whitespace replaced with a space (includes new lines etc...)
  doc = nlp(text)
  matches = matcher(doc)

  #Remove duplicate sentences which have multiple matches:
  matched_sents_dedupe = []
  for i in matched_sents:
    if i['text'] not in [j['text'] for j in matched_sents_dedupe]: #prevents duplicate sents when there is >1 match per sent
      if not any(re.findall(r'privacy|information|data|you|cookie', i['text'], re.IGNORECASE)): # filter out data protection commitments as often this is in the same return as MS
        matched_sents_dedupe.append(i)
  processed_text = ''
  for i in matched_sents_dedupe:
    processed_text += " "+i['text']
  Output.append(processed_text)


CPU times: user 1.12 s, sys: 67.1 ms, total: 1.19 s
Wall time: 1.19 s


In [0]:
df['clean text'] = Output

## Interactive data exploration

In [0]:
!pip install pivottablejs
from pivottablejs import pivot_ui
#if using locally you can just use the following to display the output: pivot_ui(df)
# As we are using colab, we will just download the output - this can then be opened in a new tab in the browser

df_vis = df[['Company', 'URL', 'Industry', 'HQ', 'Also Covers Companies',
       'UK Modern Slavery Act', 'California Transparency in Supply Chains Act',
       'Period Covered','Year', 'pdf', 'error']]

pivot_ui(df_vis,outfile_path='pivottablejs.html')
HTML('pivottablejs.html')

# if you want to download to open in a new tab in the browser - use the below:
# from google.colab import files
# files.download('pivottablejs.html') 



## Creating the interactve vis:

In [0]:
#select industries to compare:
ind1 = 'Specialty Retail'
ind2 = 'Construction & Engineering'

#Filter into a new df with 3 columns one for industry, one for company and the third containing the text
ftr      = (df['Industry'] == ind1) | (df['Industry'] == ind2)
df_corp  = df.loc[ftr]
df_corp  = df_corp[['Industry','Company','clean text']]

#Create a scattertext corpus from the df:
corpus = st.CorpusFromPandas( df_corp, 
                              category_col='Industry', 
                              text_col='clean text',
                              nlp=nlp).build()


In [0]:
html = st.produce_scattertext_explorer(corpus,
         category='Construction & Engineering',
         category_name='Construction & Engineering',
         not_category_name=ind1,      
         width_in_pixels=1600)
open("MS-Visualization.html", 'wb').write(html.encode('utf-8'))
HTML(html)

For downloading:

In [0]:
from google.colab import files
files.download('MS-Visualization.html') 

## WORKINGS:

In [0]:
text = 'The Company currently operates in the following countries: the United Kingdom'
matched_sents =[]
doc = nlp(text)
for token in doc:
    print(token.text, token.pos_, token.tag_, token.dep_, token.lemma_)
    
doc = nlp(text)
matches = matcher(doc)
print(matches)

text[matched_sents[0]['ents'][0]['start']:matched_sents[0]['ents'][0]['end']]
# matched_sents

The DET DT det the
Company PROPN NNP nsubj Company
currently ADV RB advmod currently
operates VERB VBZ ROOT operate
in ADP IN prep in
the DET DT det the
following VERB VBG amod follow
countries NOUN NNS pobj country
: PUNCT : punct :
the DET DT det the
United PROPN NNP compound United
Kingdom PROPN NNP appos Kingdom
[(14584963062971571048, 1, 4)]


'Company currently operates'

In [0]:
print(df['clean text'].values[608])

 Bodycote plc and its subsidiaries undertake all reasonable and practicable steps to ensure that our ethical standards are being implemented throughout the businesses of our suppliers and that local legislation and regulations are complied with. Suppliers in those countries identified in Walk Free Foundation’s 2016 Global Slavery Index as being the most vulnerable to human rights issues in the supply chain have been identified for further review and audit. We have a Code of Conduct, which sets out our policy on legislation, child labour, anti-slavery and human trafficking, conditions of employment, health and safety and the environment. We have a robust Open Door Line for employees and temporary workers to report any ethical concerns they may have on a confidential basis in their own language. We will only knowingly trade with those who comply with this policy or those who are taking verifiable steps towards compliance. This statement has been approved by our Board of Directors, who wi