# Data science society Dathaton 2018: starter on the [Ontotext case](https://www.datasciencesociety.net/the-ontotext-case-data-enriched/) with python

<br>

**Advice from a mentor**: *Thomas Roca, PhD Data strategist @Microsoft*

Need more help ? Find me on: https://ask.datasciencesociety.net/directory :@thoms

# 1. Where to start for building a training set
- First step get industry classification from [ICB](https://en.wikipedia.org/wiki/Industry_Classification_Benchmark)
- To start with the training set, a way could be to take a look at *Forbes* 2000 global ranking of the 2000 biggest global company. CSV of if can be found on [Google dataset search](https://toolbox.google.com/datasetsearch/search?query=Forbes%20Global%202000%202018&docid=P8adQYMuL1Bt2h1VAAAAAA%3D%3D)
- a matching with ICB classification may be necessary..
- Extra info are available on open data for exemple:
    - here: http://factforge.net/ or there: http://dbpedia.org/page/Microsoft

# 2. Build a text classification model 
- **Machine learning Ressources:** 
    - eg. with **scikit learn**:
        - http://scikit-learn.org/stable/modules/svm.html#classification
        - blog post of an implementation of it: https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f
        - Ressources on linkedin Learning: https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/    
    - Deep learning with **TensorFlow**: 
        - https://www.tensorflow.org/tutorials/text_classification_with_tf_hub
        - https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
   
**Some advices for features extraction and text cleaning**:
Machine does make distinctions between each and every character. Reduce noise by sticiking to the meaning by: lowering case for eveything, getting rid of punctuation, stop words, numbers if not relevant. Tokenize, try to [stemm or lemmatize](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) if needed.

**Preprocessing tips:**
- https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/preprocess-text
- https://textminingonline.com/dive-into-nltk-part-iv-stemming-and-lemmatization


 
### other resources
- http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
- https://www.datacamp.com/community/tutorials/scikit-learn-fake-news
- https://www.learndatasci.com/tutorials/predicting-reddit-news-sentiment-naive-bayes-text-classifiers/
- https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
- https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
- https://github.com/kjam/random_hackery/blob/master/Comparing%20Fake%20News%20Classifiers.ipynb
- https://blog.kjamistan.com/comparing-scikit-learn-text-classifiers-on-a-fake-news-dataset/ 
- https://towardsdatascience.com/machine-learning-nlp-text-classification-using-scikit-learn-python-and-nltk-c52b92a7c73a
- https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/
    

## Get industry classification from [ICB](https://en.wikipedia.org/wiki/Industry_Classification_Benchmark)

Just an example of the use of [beautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/) for scarpping web pages:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import numpy as np

#page url
link='https://en.wikipedia.org/wiki/Industry_Classification_Benchmark'

#Empty array to store the info we want
district_list=[]
district_density=[]

#Read the web page
html=urlopen(link).read()
#Parse it with BeautifulSoup
soup=BeautifulSoup(html, "lxml")
table=soup.find_all("table")
td = table[0]('td')
dict_sector={}
dict_subsector={}
dict_industry={}
dict_supersector={}
dict_all={}
for item in td:
    code=item.text.replace('\n','').split(' ')[0]
    content=" ".join(item.text.replace('\n','').split(' ')[1:])
    #print(int(code), content)
    dict_all[int(code)]=content
    if int(code) % 1000 ==0: #modulo to sort out the hierarchy among the codes 
        dict_industry[int(code)]=content
    elif int(code) % 100 ==0: 
        dict_supersector[int(code)]=content
    elif int(code) % 10 ==0: 
        dict_sector[int(code)]=content  
    else:
        if int(code)!=1: dict_subsector[int(code)]=content

In [2]:
dict_industry

{1000: 'Basic Materials',
 2000: 'Industrials',
 3000: 'Consumer Goods',
 4000: 'Health Care',
 5000: 'Consumer Services',
 6000: 'Telecommunications',
 7000: 'Utilities',
 8000: 'Financials',
 9000: 'Technology'}

In [3]:
dict_supersector

{500: 'Oil & Gas',
 1300: 'Chemicals',
 1700: 'Basic Resources',
 2300: 'Construction & Materials',
 2700: 'Industrial Goods & Services',
 3300: 'Automobiles & Parts',
 3500: 'Food & Beverage',
 3700: 'Personal & Household Goods',
 4500: 'Health Care',
 5300: 'Retail',
 5500: 'Media',
 5700: 'Travel & Leisure',
 6500: 'Telecommunications',
 7500: 'Utilities',
 8300: 'Banks',
 8500: 'Insurance',
 8600: 'Real Estate',
 8700: 'Financial Services',
 9500: 'Technology'}

In [4]:
dict_sector

{530: 'Oil & Gas Producers',
 570: 'Oil Equipment, Services & Distribution',
 580: 'Alternative Energy',
 1350: 'Chemicals',
 1730: 'Forestry & Paper',
 1750: 'Industrial Metals & Mining',
 1770: 'Mining',
 2350: 'Construction & Materials',
 2710: 'Aerospace & Defense',
 2720: 'General Industrials',
 2730: 'Electronic & Electrical Equipment',
 2750: 'Industrial Engineering',
 2770: 'Industrial Transportation',
 2790: 'Support Services',
 3350: 'Automobiles & Parts',
 3530: 'Beverages',
 3570: 'Food Producers',
 3720: 'Household Goods & Home Construction',
 3740: 'Leisure Goods',
 3760: 'Personal Goods',
 3780: 'Tobacco',
 4530: 'Health Care Equipment & Services',
 4570: 'Pharmaceuticals & Biotechnology',
 5330: 'Food & Drug Retailers',
 5370: 'General Retailers',
 5550: 'Media',
 5750: 'Travel & Leisure',
 6530: 'Fixed Line Telecommunications',
 6570: 'Mobile Telecommunications',
 7530: 'Electricity',
 7570: 'Gas, Water & Multiutilities',
 8350: 'Banks',
 8530: 'Nonlife Insurance',
 85

In [5]:
dict_subsector

{533: 'Exploration & Production',
 537: 'Integrated Oil & Gas',
 573: 'Oil Equipment & Services',
 577: 'Pipelines',
 583: 'Renewable Energy Equipment',
 587: 'Alternative Fuels',
 1353: 'Commodity Chemicals',
 1357: 'Specialty Chemicals',
 1733: 'Forestry',
 1737: 'Paper',
 1753: 'Aluminum',
 1755: 'Nonferrous Metals',
 1757: 'Iron & Steel',
 1771: 'Coal',
 1773: 'Diamonds & Gemstones',
 1775: 'General Mining',
 1777: 'Gold Mining',
 1779: 'Platinum & Precious Metals',
 2353: 'Building Materials & Fixtures',
 2357: 'Heavy Construction',
 2713: 'Aerospace',
 2717: 'Defense',
 2723: 'Containers & Packaging',
 2727: 'Diversified Industrials',
 2733: 'Electrical Components & Equipment',
 2737: 'Electronic Equipment',
 2753: 'Commercial Vehicles & Trucks',
 2757: 'Industrial Machinery',
 2771: 'Delivery Services',
 2773: 'Marine Transportation',
 2775: 'Railroads',
 2777: 'Transportation Services',
 2779: 'Trucking',
 2791: 'Business Support Services',
 2793: 'Business Training & Employmen

## Enough for me, your turn to play...