### Test du modèle SGD 

J'ai passé le corpus anglais prétraité de la même façon que pour le modèle SVM dans le modèle SGD avec les paramètres fixés par Micha pour voir la différence.

J'ai testé avec TfIdfVectorizer et avec CountVectorizer. Les scores sont très équilibrés dans le sens où il n'y a presque pas d'écart entre les différentes métriques. Au final, j'ai obtenu les mêmes résultats que Micha. Mes scores ont baissé avec TfIdfVectorizer. Je l'explique par le fait que la taille du corpus n'est pas suffisante pour un calcul de TF-IDF qui prend en compte l'importance de chaque terme. Ensemble avec l'algorithme SGD (qui lui aussi exige beaucoup de données) cela donne des résutlats insatisfaisants.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
import xml.etree.ElementTree as ET
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.metrics import classification_report

In [4]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
def preprocess(text):
    words = word_tokenize(text)
    words = [word.lower() for word in words]
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]

    return ' '.join(words)

In [6]:
def parse_xml_train(file_path):
    tree = ET.parse(file_path)
    root = tree.getroot()

    data = []
    labels = []

    for doc in root.findall('.//doc'):
        party = doc.find('.//PARTI').attrib['valeur']
        labels.append(party)
        paragraphs = [p.text.strip() if p.text is not None else '' for p in doc.findall('.//texte/p')]
        data.append(preprocess(' '.join(paragraphs)))

    return data, labels

In [7]:
def load_text_file(text_file_path):
    with open(text_file_path, 'r') as file:
        lines = file.readlines()

    # Extract the mapping between document IDs and numerical party labels
    party_id_mapping = {}
    for line in lines:
        parts = line.strip().split('\t')
        if len(parts) == 2:
            doc_id, label = parts
            party_id_mapping[int(doc_id)] = label

    return party_id_mapping

In [8]:
def parse_xml_test(file_path_xml, file_path_txt):

    party_info = load_text_file(file_path_txt)

    doc_id_to_text = {}
    party_labels = []
    texts = []

    tree = ET.parse(file_path_xml)
    root = tree.getroot()


    for doc in root.findall('.//doc'):
        doc_id = doc.get('id')
        text_data = ' '.join([p.text if p.text is not None else '' for p in doc.findall('.//texte/p')])
        doc_id_to_text[int(doc_id)] = text_data

    # Iterate through common keys in both dictionaries
    common_keys = set(party_info.keys()) & set(doc_id_to_text.keys())
    for doc_id in common_keys:
        party_labels.append(party_info[doc_id])
        texts.append(preprocess(doc_id_to_text[doc_id]))

    return  texts, party_labels

In [10]:
train_texts, train_labels = parse_xml_train('../../deft09_parlement_appr_en.xml')

test_texts, test_labels = parse_xml_test('../../deft09_parlement_test_en.xml', '../../deft09_parlement_ref_en.txt')

In [29]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

In [30]:
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_labels)
y_test = label_encoder.transform(test_labels)

In [31]:
clf = SGDClassifier(max_iter=1000, loss="modified_huber", n_jobs=-1, learning_rate="optimal", random_state=42)
clf.fit(X_train, y_train)

In [32]:
y_test_pred = clf.predict(X_test)

In [33]:
print("Accuracy:", accuracy_score(y_test, y_test_pred))
print("Precision:", precision_score(y_test, y_test_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_test_pred, average='weighted'))
print("F1 Score:", f1_score(y_test, y_test_pred, average='weighted'))

Accuracy: 0.7454123112659699
Precision: 0.7466279560368361
Recall: 0.7454123112659699
F1 Score: 0.7453506109356548


### CountVectorizer Results

<img src="../images/sgd countvec clean.png" width=350px height=180px />

### TfIdfVectorizer Results

<img src="../images/sgd clean tfidf (2) param micha.png" width=350px height=180px />