**DEVOIR 4 - Classification de textes**

1. Numéro du groupe: 22
   Noms des membres: Salim Rholam
   Numéros d'étudiants des membres: 300205835
   Titre: \\

**Datasets dérivés**

Ce notebook est un point de départ pour le devoir 4. Dans ce devoir, vous effectuerez une étude empirique de classification. Ce notebook vous aidera à créer des ensembles de données dérivés dans la section 2 du devoir.

In [None]:
#let's start by installing spaCy
!pip install spacy

In [2]:
import spacy
import pandas as pd
import numpy as np

Vous avez reçu une liste de datasets dans la description du devoir. Choisissez l'un des datasets, fournissez le lien ci-dessous et lisez ce dataset à l'aide de pandas. Vous devez fournir un lien vers votre propre répertoire Github même si vous utilisez une version réduite d'un ensemble de données du répertoire de votre TA.

Vous devrez ajouter une description de l'ensemble de données et votre justification des choix effectués pour obtenir les ensembles de données dérivés.

In [3]:
#Load the dataset you chose.
# Make sure the Notebook can load your dataset, just like previous assignments.

# url = 'https://raw.githubusercontent.com/baharin/CSI4106-Assignment4-Datasets/main/reduced_file_cnnnews.csv'
# url = 'https://raw.githubusercontent.com/baharin/CSI4106-Assignment4-Datasets/main/reduced_drugsComTest_raw_fiveclasses.csv'
url = 'https://raw.githubusercontent.com/RhSalim/dataset/22ba73653058f1df52ac156c6d0831349ead0034/reduced_file_cnnnews.csv'

#provide the link to the raw version of dataset. You *need* to provide a link to *your own* github repository. DO NOT use the link that is provided as an example.



In [None]:
print(url)
data = pd.read_csv(url)

In [None]:
data.head()

Ici vous créez votre pipeline TAL. load() téléchargera le bon modèle pour l'analyse (English).

In [6]:
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

L'application du pipeline à chaque phrase crée un document dans lequel chaque mot est un objet Token.

Doc: https://spacy.io/api/doc

Token: https://spacy.io/api/token

In [None]:
#Apply nlp pipeline to the column that has your sentences (the text that will serve as input features).
data['tokenized'] = data['Body'].apply(lambda text: nlp(text))

In [None]:
data.head()

Un token a plusieurs attributs, tel part-of-speech (pos_), lemma (lemma_), etc. Regardez la documentation pour voir l'ensemble des attributs.

La fonction suivante est un exemple de la façon dont vous pouvez récupérer les parties du discours (POS)à partir d'une phrase. Nous renvoyons la lemmatisation car nous voulons uniquement le mot à l'infinitif.

In [None]:
#create empty dataframes that will store your derived datasets

derived_dataset1 = pd.DataFrame(columns = ['Class', 'pos'])
derived_dataset2 = pd.DataFrame(columns = ['Class', 'pos-np'])

In [9]:
def get_pos(sentence, wanted_pos): #wanted_pos refers to the desired pos tagging
    verbs = []
    for token in sentence:
        if token.pos_ in wanted_pos:
            verbs.append(token.lemma_) # lemma returns a number. lemma_ return a string
    return ' '.join(verbs) # return value is as a string and not a list for countVectorizer

In [None]:
#As an example, we use the above function to fetch all the verbs. We store this information in our first derived dataset
derived_dataset1['pos'] = data['tokenized'].apply(lambda sent : get_pos(sent, ['VERB']))

In [None]:
derived_dataset1.head()

In [None]:
#Change this line to fetch your desired pos taggings for the second derived dataset
derived_dataset2['pos-np'] = data['tokenized'].apply(lambda sent : get_pos(sent, ['ADJECTIVE']))

In [None]:
#For Derived Dataset 2, you also need to include Named Entities
#Below is just an example of obtaining such entities on a specific sentence, but you would do NER
#on the dataset of your choice.
#You can choose the types of entities (dates, organization, people) that you want,
#and then in your derived dataset, just make sure you include these entities separated by spaces (as shown for verbs)
#in a previous cell.

sentence = "apple is looking at buying U.K. startup for $1 billion"
doc = nlp(sentence)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Maintenant que vous disposez de vos ensembles de données dérivés, vous pouvez effectuer votre tâche de classification.

**Classification avec MLP et Régression logistique**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, classification_report

# Assuming 'Class' column in the original dataset represents the target variable
X = derived_dataset1['pos']  # Input features
y = data['Class']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text data to numerical format using CountVectorizer or TF-IDF Vectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Initialize and train the MLP classifier
mlp_classifier = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)
mlp_classifier.fit(X_train_vectorized, y_train)

# Make predictions on the test set
predictions = mlp_classifier.predict(X_test_vectorized)

# Evaluate the performance
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")

# Display additional classification metrics
print("\nClassification Report:")
print(classification_report(y_test, predictions))


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Assuming you have your target variable in the 'Class' column
y = data['Class']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(derived_dataset1['pos'], y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model for derived dataset 1
logistic_regression_model1 = LogisticRegression()
logistic_regression_model1.fit(X_train.apply(lambda x: ' '.join(x)), y_train)

# Make predictions on the test set
predictions1 = logistic_regression_model1.predict(X_test.apply(lambda x: ' '.join(x)))

# Calculate and print accuracy
accuracy1 = accuracy_score(y_test, predictions1)
print(f"Accuracy for derived dataset 1: {accuracy1}")

# Repeat the process for derived dataset 2
X_train, X_test, y_train, y_test = train_test_split(derived_dataset2['pos-np'], y, test_size=0.2, random_state=42)

logistic_regression_model2 = LogisticRegression()
logistic_regression_model2.fit(X_train.apply(lambda x: ' '.join(x)), y_train)

predictions2 = logistic_regression_model2.predict(X_test.apply(lambda x: ' '.join(x)))

accuracy2 = accuracy_score(y_test, predictions2)
print(f"Accuracy for derived dataset 2: {accuracy2}")
