<a href="https://colab.research.google.com/github/Kemadjou-Elodie/Deep-Learning-Projet/blob/master/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Multi-class Text Categorization**

**Objective**. In this technical test you you are asked to solve a multi-class classification problem on
textual data using machine learning, commonly known as text categorization. The technical
test is composed of two parts, model experimentation and deployment. More details will be explained
in the corresponding sections.

######  Load the data - The dataset comprises around 18000 documents posts on 20 topics split in two subsets : one for training and another for testing.

In [1]:
 ! pip  install  googledrivedownloader



In [2]:
import os
import nltk
import random
import numpy as np
import pandas as pd

from spacy.tokenizer import Tokenizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [3]:
from google_drive_downloader import GoogleDriveDownloader  as gdd

In [4]:
gdd.download_file_from_google_drive(file_id ='1ywHLd78-Ms5SmyEuHGmJDDGHSGsvvcD2', dest_path ='./dataset.zip', unzip=True)

Downloading 1ywHLd78-Ms5SmyEuHGmJDDGHSGsvvcD2 into ./dataset.zip... Done.
Unzipping...Done.


In [5]:
data_dir_test = '/content/test'
data_dir_train = '/content/train'
df_test = os.listdir(data_dir_test)
df_train = os.listdir(data_dir_train)

In [6]:
from os import listdir
from os.path import isfile, join
files = []
for folder_name in df_test:
    folder_path = join(data_dir_test, folder_name)
    files.append([f for f in listdir(folder_path)])

In [7]:
sum(len(files[i]) for i in range(20))

7532

In [8]:
df_test

['soc.religion.christian',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'alt.atheism',
 'talk.politics.mideast',
 'rec.sport.hockey',
 'talk.politics.misc',
 'sci.space',
 'comp.windows.x',
 'comp.sys.ibm.pc.hardware',
 'sci.electronics',
 'sci.med',
 'sci.crypt',
 'misc.forsale',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.autos',
 'comp.sys.mac.hardware',
 'talk.religion.misc',
 'talk.politics.guns']

In [9]:
df_train

['soc.religion.christian',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'alt.atheism',
 'talk.politics.mideast',
 'rec.sport.hockey',
 'talk.politics.misc',
 'sci.space',
 'comp.windows.x',
 'comp.sys.ibm.pc.hardware',
 'sci.electronics',
 'sci.med',
 'sci.crypt',
 'misc.forsale',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.autos',
 'comp.sys.mac.hardware',
 'talk.religion.misc',
 'talk.politics.guns']

In [10]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [11]:
from pprint import pprint
pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [12]:
newsgroups_train.filenames.shape

(11314,)

In [13]:
newsgroups_train.target.shape

(11314,)

Pour le preprocessing je commence avec le modele back of words ou **tf-idf** pour voir a quel point tu parviens classifier les differentes categories

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

(2034, 34118)

In [15]:
vectors.nnz / float(vectors.shape[0])

159.0132743362832

Les vecteurs TF-IDF extraits sont très rares, avec une moyenne de 159 composantes non nulles par échantillon dans un espace de plus de 30000 dimensions (moins de 0,5% d'entités non nulles)

In [16]:
#importing dataset from sklearn
from sklearn.datasets import fetch_20newsgroups
#importing train and test dataset
train_df= fetch_20newsgroups(subset="train" ,categories = categories) 
test_df= fetch_20newsgroups(subset="test" ,categories = categories)

In [17]:
X_train = train_df["data"]
X_test=test_df['data']
y_train = train_df["target"] 
y_test=test_df['target']

In [18]:
df=pd.DataFrame(X_train,columns=['mess'])

In [19]:
#adding a target column
df['target']=y_train

In [20]:
#making length a feature for visualizations
df['length']=df['mess'].apply(len)
df.head()

Unnamed: 0,mess,target,length
0,From: rych@festival.ed.ac.uk (R Hawkes)\nSubje...,1,1022
1,Subject: Re: Biblical Backing of Koresh's 3-02...,3,1117
2,From: Mark.Perew@p201.f208.n103.z1.fidonet.org...,2,572
3,From: dpw@sei.cmu.edu (David Wood)\nSubject: R...,0,1454
4,From: prb@access.digex.com (Pat)\nSubject: Con...,2,449


# Text Pre-processing

In [21]:
#importing string for punctuations
import string
import nltk
#now we import most common words i.e. stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [22]:
#making a function to process our data
def text_process(mess):
    no_punc=[c for c in mess if c not in string.punctuation]
    no_punc=''.join(no_punc)
    cleaned_mess=[word for word in no_punc.split() if word.lower() not in stopwords.words('english')]
    return cleaned_mess

In [23]:
##applying our text_process function
#adding processed data to a new column
df['message']=df['mess'].apply(text_process)

In [24]:
import re

lexicon = (
    (re.compile(r"\bdon't\b"), "do not"),
    (re.compile(r"\bit's\b"), "it is"),
    (re.compile(r"\bi'm\b"), "i am"),
    (re.compile(r"\bi've\b"), "i have"),
    (re.compile(r"\bcan't\b"), "cannot"),
    (re.compile(r"\bdoesn't\b"), "does not"),
    (re.compile(r"\bthat's\b"), "that is"),
    (re.compile(r"\bdidn't\b"), "did not"),
    (re.compile(r"\bi'd\b"), "i would"),
    (re.compile(r"\byou're\b"), "you are"),
    (re.compile(r"\bisn't\b"), "is not"),
    (re.compile(r"\bi'll\b"), "i will"),
    (re.compile(r"\bthere's\b"), "there is"),
    (re.compile(r"\bwon't\b"), "will not"),
    (re.compile(r"\bwoudn't\b"), "would not"),
    (re.compile(r"\bhe's\b"), "he is"),
    (re.compile(r"\bthey're\b"), "they are"),
    (re.compile(r"\bwe're\b"), "we are"),
    (re.compile(r"\blet's\b"), "let us"),
    (re.compile(r"\bhaven't\b"), "have not"),
    (re.compile(r"\bwhat's\b"), "what is"),
    (re.compile(r"\baren't\b"), "are not"),
    (re.compile(r"\bwasn't\b"), "was not"),
    (re.compile(r"\bwouldn't\b"), "would not"),
)

def fix_apostrophes(text):
    text = text.lower()
    
    for pattern, replacement in lexicon:
        text = pattern.sub(replacement, text)

    return text

text_train = list(map(fix_apostrophes, train_df))
text_test = list(map(fix_apostrophes, test_df))

# Normalization & Vecorization

In [25]:
#Importing CountVectorizer to a collection of text documents to a matrix of token counts.
from sklearn.feature_extraction.text import CountVectorizer

In [26]:
bow_transformer = CountVectorizer(analyzer=text_process).fit(df['message'])
# Print total number of vocab words
print(len(bow_transformer.vocabulary_))

2034


# Classification

Pour la classification j'utiliser un algo comme les machine a support de vecteur (Support vector machine SVM) 

Pour l'evaluation du modele tu peux utiliser le F-beta score le macro/micro average

In [27]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LinearSVC()),
                     ])

text_clf.fit(X_train, y_train)


predicted = text_clf.predict(X_test)

print(metrics.classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.85      0.83      0.84       319
           1       0.92      0.97      0.94       389
           2       0.95      0.95      0.95       394
           3       0.81      0.76      0.79       251

    accuracy                           0.89      1353
   macro avg       0.88      0.88      0.88      1353
weighted avg       0.89      0.89      0.89      1353



In [28]:
from sklearn.metrics import fbeta_score

In [29]:
fbeta_score(y_test, predicted, average='macro', beta=0.5)

0.8823503546816361