# [Projet 6: Catégorisez automatiquement des questions](https://openclassrooms.com/fr/projects/categorisez-automatiquement-des-questions)
(parcours data: [here](https://openclassrooms.com/paths/63-data-scientist))

Data exporter (StackExchange): [here](https://data.stackexchange.com/stackoverflow/query/new).  
My minimal SQL query:
```
SELECT
   Id,Body,Title,Tags
FROM
   Posts
WHERE
   Id < 5000 and Body<>'' and Title<>'' and Tags <>''
```

### Imports

In [1]:
import os
HOME = os.path.expanduser('~/')
HOST = os.uname()[1]
if HOST == 'Arthurs-MacBook-Pro.local':
    os.chdir(HOME+'Documents/GitHub/OCDataSciencePath/Project6/')    # @home
else:
    raise ValueError('unknown host: {}'.format(HOST))
    
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

Some info and uses for [BeautifulSoup](https://pypi.org/project/beautifulsoup4/): [here](https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python),

In [2]:
from bs4 import BeautifulSoup # conda install beautifulsoup4

Some info and uses for [NLTK](https://pypi.org/project/nltk/): [here](http://www.nltk.org/book/),

In [3]:
import nltk
stopwords = nltk.corpus.stopwords.words('english')
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
stemer = nltk.PorterStemmer() # factorise this
#nltk.download('punkt')

### Data

In [4]:
if HOST == 'Arthurs-MacBook-Pro.local':
    pathToDataDir = HOME+'Documents/Dropbox/Transit/OCDataScienceData/Project6/'    # @home
else:
    raise ValueError('unknown host: {}'.format(HOST))

In [5]:
filename = 'QueryResults_light.csv'
df = pd.read_csv(pathToDataDir+filename,index_col='Id')

In [6]:
df.sample(5)

Unnamed: 0_level_0,Body,Title,Tags
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3927,<p>What profilers have you used when working w...,What Are Some Good .NET Profilers?,<c#><.net><profiling><profiler>
146,<p>I have a website that plays mp3s in a flash...,How do I track file downloads,<php><apache><logging><download><analytics>
2780,<p>Let's say that we have an ARGB color:</p>\n...,Converting ARBG to RGB with alpha blending,<c#><colors>
3667,<p>We are currently using a somewhat complicat...,What is your favorite web app deployment workf...,<svn><deployment>
871,"<p>I've been using <a href=""http://en.wikipedi...",Why is Git better than Subversion?,<svn><git>


### Cleaning

In [7]:
def basicHTMLTextCleaner(htmlText,tokenizer,stopwords,stemer):
    '''
    basic tokenizer for raw html text.
    
    Inputs
    ------
    htmlText : str
        html text.
    tokenizer : nltk.tokenize.regexp.RegexpTokenizer or similar
        used to tokenize.
    stopwords : nltk.corpus.reader.wordlist.WordListCorpusReader or similar
        used to remove stopwords.
    stemer : nltk.stem.porter.PorterStemmer or similar
        used to stem.
        
    Returns
    -------
    text : nltk.Text
        the cleaned text, as a nltk object.
    '''
    soup = BeautifulSoup(htmlText)
    raw = soup.get_text(strip=False).lower()                # text from html + lower
    tokens = tokenizer.tokenize(raw)                        # tokenization
    tokens_stop = [t for t in tokens if t not in stopwords] # stopwords
    tokens_stp_stem = [stemer.stem(t) for t in tokens_stop] # stemming

    text = nltk.Text(tokens_stp_stem)
    return text

In [8]:
def basicTagTextCleaner(tagText):
    '''
    basic tokenizer for tag text.
    
    Inputs
    ------
    tagText : str
        list of tags.
        
    Returns
    -------
    text : nltk.Text
        the cleaned tag text, as a nltk object.
    '''
    tokens = tagText.split()
    text = nltk.Text(tokens)
    return text

In [9]:
for c in ('Body','Title'):
    df[c+'_clean'] = df[c].apply(lambda x: basicHTMLTextCleaner(x,tokenizer,stopwords,stemer))
for c in ('Tags',):
    df[c+'_clean'] = df[c].apply(lambda x: basicTagTextCleaner(x))

In [10]:
df.sample(5)

Unnamed: 0_level_0,Body,Title,Tags,Body_clean,Title_clean,Tags_clean
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2530,<p>How do you disable <code>autocomplete</code...,How do you disable browser Autocomplete on web...,<html><browser><autocomplete>,"(disabl, autocomplet, major, browser, specif, ...","(disabl, browser, autocomplet, web, form, fiel...",(<html><browser><autocomplete>)
4986,<p>What is a good way to get people to alpha t...,Web App Beta,<web-applications>,"(good, way, get, peopl, alpha, test, web, appl...","(web, app, beta)",(<web-applications>)
1229,<p>I want to link to a specific slide in an on...,How do I hyperlink to a specific slide of a .p...,<hyperlink><powerpoint>,"(want, link, specif, slide, onlin, powerpoint,...","(hyperlink, specif, slide, ppt, file)",(<hyperlink><powerpoint>)
2256,<p>Is there a way of mapping data collected on...,Mapping Stream data to data structures in C#,<c#><c++><data-structures>,"(way, map, data, collect, stream, array, data,...","(map, stream, data, data, structur, c)",(<c#><c++><data-structures>)
651,<p>I've been having trouble getting my ASP.NET...,Checklist for IIS 6/ASP.NET Windows Authentica...,<asp.net><iis><authentication><active-directory>,"(troubl, get, asp, net, applic, automat, log, ...","(checklist, ii, 6, asp, net, window, authent)",(<asp.net><iis><authentication><active-directo...


### Cleaned data

In [11]:
name,ext = os.path.splitext(filename)

c = ('Body_clean','Title_clean','Tags_clean')
df.loc[:,c].to_csv(os.path.join(pathToDataDir,name+'_clean'+ext),index=True)