# 1 - nettoyage et analyse exploratoire des données
---

**Imports des données**

In [1]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup
import nltk

# Download text data sets, including stop words
nltk.download('stopwords')

# Download text data sets, including stop words
df_posts = pd.read_csv('data/df_posts.csv')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adrie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df_posts.sample(10)

Unnamed: 0,Id,Body,Title,Tags
7977,70625,<p>How would you refactor these two classes to...,Refactoring two basic classes,<cultureinfo><regioninfo>
639,6484,"<p><a href=""http://www.google.com/search?q=how...",How do you mock a Sealed class?,<unit-testing><language-agnostic><tdd><mocking>
721,7477,<p>I'm currently working on an internal sales ...,How to autosize a textarea using Prototype?,<javascript><html><css><textarea><prototypejs>
2417,24954,<p>How to determine the applications associate...,Windows: List and Launch applications associat...,<.net><windows><registry>
1896,19442,<p>How can I create this file in a directory i...,how to allow files starting with period and no...,<mercurial><windows-server-2003><hgignore>
2203,22676,<p>I have a small utility that I use to downlo...,How to download a file over HTTP?,<python><http><urllib>
9126,81191,"<p>While answering <a href=""https://stackoverf...",PythonWin's python interactive shell calling c...,<python><activestate>
3645,35541,<p>Doug McCune had created something that was ...,Are there any good programs for actionscript/f...,<apache-flex><actionscript-3><code-analysis>
855,8761,<p>I'm updating some of our legacy C++ code to...,Find out which colours are in use when using t...,<colors><mfc-feature-pack>
4488,42354,<p>ObjectPal is the programming language used ...,Does anyone still use ObjectPal?,<paradox>


**Informations sur les colonnes du dataframe**

In [3]:
def overview(dataframe):
    df_overview = pd.DataFrame({})
    df_overview['column'] = [col for col in dataframe.columns]
    df_overview['qty_null_column'] = [dataframe[col].isna().sum() for col in dataframe.columns]
    df_overview['percent_null'] = df_overview['qty_null_column'] / dataframe.shape[0] * 100.00
    df_overview['dtype'] = [dtype for dtype in dataframe.dtypes]
    df_overview['qty_category_unique'] = [len(dataframe[col].value_counts()) for col in dataframe.columns]

    return df_overview

In [4]:
overview(df_posts)

Unnamed: 0,column,qty_null_column,percent_null,dtype,qty_category_unique
0,Id,0,0.0,int64,11149
1,Body,0,0.0,object,11148
2,Title,0,0.0,object,11148
3,Tags,0,0.0,object,9290


**Suppression des quelques duplicats**

In [5]:
df_posts.drop_duplicates(subset=['Title'], inplace =True)
df_posts.drop_duplicates(subset=['Body'], inplace =True)

In [6]:
overview(df_posts)

Unnamed: 0,column,qty_null_column,percent_null,dtype,qty_category_unique
0,Id,0,0.0,int64,11147
1,Body,0,0.0,object,11147
2,Title,0,0.0,object,11147
3,Tags,0,0.0,object,9288


**Fonction de nettoyage : html_remove/letters_only/lower_case/stopwords_remove**

In [7]:
from nltk.corpus import stopwords

def cleaning(text):

    text = BeautifulSoup(text).get_text()
    text = re.sub("[^a-zA-Z]", " ", text)
    text = text.lower()
    
    words = text.split()
    set_stopwords = set(stopwords.words("english"))
    meaningful_words = [w for w in words if not w in set_stopwords]   

    return( " ".join(meaningful_words))   

**Body**

In [8]:
df_posts['Body'] = df_posts['Body'].apply(lambda x: cleaning(x))

In [9]:
df_posts['Body'].sample(5)

9479    xml schema part specifies instance datatype de...
8292    tools websites use read javadocs currently use...
925     asp net user control adding javascript window ...
1856    developing j application large amount data sto...
5317    n equal following function void foo char cvalu...
Name: Body, dtype: object

**Title**

In [10]:
df_posts['Title'] = df_posts['Title'].apply(lambda x: cleaning(x))

In [11]:
df_posts['Title'].sample(5)

4871               sql file encoding visual studio
4371              generating javascript stubs wsdl
6571             show whole height referenced page
2546    windows service increasing cpu consumption
4874        ms sql fti searching n returns numbers
Name: Title, dtype: object

**Tags**

In [12]:
df_posts['Tags'] = df_posts['Tags'].apply(lambda x: x.replace('<',''))
df_posts['Tags'] = df_posts['Tags'].apply(lambda x: x.replace('>',' '))
df_posts['Tags'] = df_posts['Tags'].apply(lambda x: x.split())

In [13]:
df_posts['Tags'].sample(5)

4978         [iphone, xib, key-value-coding]
9473                     [css, class, xhtml]
2825    [open-source, software-distribution]
6145             [c#, asp.net, asp.net-ajax]
4038                [database, flash, adobe]
Name: Tags, dtype: object

In [14]:
tags = [tag for row in df_posts['Tags'].values for tag in row ]

**Extraction de caractéristiques**

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
posts_body_tfidf = tfidf.fit_transform(df_posts['Body'])

In [21]:
np.save("data/posts_body_feature_names.npy", tfidf.get_feature_names(), allow_pickle=True)

In [17]:
np.save("data/posts_body_tfidf.npy", posts_body_tfidf, allow_pickle=True)