# Plantilla de desarrollo para primer examen parcial

**Pautas:**
- La presente plantilla es un EJEMPLO de cómo ordenar el código de tu examen
- Tienes la libertad DE AGREGAR todos los métodos y secciones en el examen que consideres necesarias
- Realizar el desarrollo por medio de métodos, por ejemplo, ReadInfo(), TrainModel(), etc 
- Los métodos deberán de estar lo mas claro y modularizados que sea posible
- Realizar la documentación de cada método por medio de comentarios y DocStrings 
- Deberás de utilizar un modelo de ML o algún ensamble de os mismos (SVC, DT, NB, KNN, etc)
- Recuerda que puedes usar un split de los datos para entrenamiento y validación
- Puedes revisar la documentación de Sklearn, o la librría que decidas utilizar para entender los parámetros de entrenamiento de los modelos
- NO está permitido el uso de modelos de Deep Learning (DNN, CNN, LSTM, etc.) NI el uso de embeddings

## Librerías a utilizar

In [10]:
# Librerías a utilizar
import pandas as pd
import spacy
import nltk
import random
from textblob import TextBlob
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Descargar recursos de nltk
nltk.download('wordnet')
nltk.download('stopwords')

nlp = spacy.load('en_core_web_lg')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mascenci/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mascenci/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Lectura de Dataset

In [11]:
def load_dataset():
    """
    Load a dataset from a CSV file, remove rows with missing values, and filter out rows with empty titles.
    
    Returns:
        pd.DataFrame: A cleaned dataframe with rows containing valid titles and no missing values.
    
    Note:
        The CSV file 'DataSet para entrenamiento del modelo.csv' should be present in the current working directory.
    """
    df = pd.read_csv('DataSet para entrenamiento del modelo.csv').dropna()
    df = df[df['title'] != ""]
    print("Shape of the original dataset:", df.shape)
    return df


## Feature Engineering

In [12]:
def feature_engineering(df):
    """
    Perform feature engineering on the given dataframe based on the 'title' column.
    
    Parameters:
        df (pd.DataFrame): The input dataframe with a 'title' column.
    
    Returns:
        pd.DataFrame: The dataframe with new features added.
    
    Features added:
        - length: Length of the title.
        - unique_words: Number of unique words in the title.
        - numbers_count: Number of digits in the title.
        - exclamation_count: Number of exclamation marks in the title.
        - sentiment: Sentiment polarity of the title using TextBlob.
        - keyword_[keyword]: Count of specific keywords in the title.
        
    Note:
        Requires TextBlob library for sentiment analysis.
    """
    df['length'] = df['title'].apply(len)
    df['unique_words'] = df['title'].apply(lambda x: len(set(x.split())))
    df['numbers_count'] = df['title'].apply(lambda x: sum(c.isdigit() for c in x))
    df['exclamation_count'] = df['title'].apply(lambda x: x.count('!'))
    df['sentiment'] = df['title'].apply(lambda x: TextBlob(x).sentiment.polarity)
    keywords = ["top", "best", "first", "most", "amazing", "incredible"]
    for keyword in keywords:
        df[f'keyword_{keyword}'] = df['title'].apply(lambda x: x.split().count(keyword))
    return df


## Data Augmentation

In [13]:
def get_synonyms(word):
    """
    Retrieve synonyms for a given word using the NLTK WordNet corpus.
    
    Parameters:
        word (str): The input word for which synonyms are to be retrieved.
    
    Returns:
        list: A list of synonyms for the given word.
    
    Note:
        Requires NLTK library and WordNet corpus.
    """
    from nltk.corpus import wordnet
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return list(synonyms)

def replace_with_synonym(sentence):
    """
    Replace words in a sentence with their synonyms.
    
    Parameters:
        sentence (str): The input sentence in which words are to be replaced with synonyms.
    
    Returns:
        str: The sentence with words replaced by their synonyms.
    
    Note:
        Not all words in the sentence may have synonyms or may be replaced.
    """
    words = sentence.split()
    for i, word in enumerate(words):
        synonyms = get_synonyms(word)
        if synonyms:
            words[i] = random.choice(synonyms).replace("_", " ")
    return ' '.join(words)

def augment(df):
    """
    Augment the dataframe by replacing titles labeled as "clickbait" with their synonym-replaced versions.
    
    Parameters:
        df (pd.DataFrame): The input dataframe with a 'title' column and a 'label' column.
    
    Returns:
        pd.DataFrame: The augmented dataframe with original and synonym-replaced titles.
    
    Note:
        Only titles with label "clickbait" are augmented.
    """
    df_augmented = df.copy()
    clickbait_rows = df_augmented[df_augmented["label"] == "clickbait"].copy()
    clickbait_rows["title"] = clickbait_rows["title"].apply(replace_with_synonym)
    return pd.concat([df, clickbait_rows], ignore_index=True)


## Preprocesamiento

In [14]:
def preprocess_text(text):
    """
    Preprocess the input text by performing the following steps:
    1. Convert to lowercase.
    2. Remove non-alphabetic characters.
    3. Remove English stopwords.
    4. Lemmatize the words using spaCy.
    
    Parameters:
        text (str): The input text to be preprocessed.
    
    Returns:
        str: The preprocessed text.
    
    Note:
        Requires NLTK library for stopwords and spaCy for lemmatization.
    """
    text = text.lower()
    text = ''.join([char for char in text if char.isalpha() or char.isspace()])
    stopwords = nltk.corpus.stopwords.words('english')
    text = ' '.join([word for word in text.split() if word not in stopwords])
    doc = nlp(text)
    return ' '.join([token.lemma_ for token in doc])


## Entrenamiento del modelo

In [15]:
def train_models(X_train, Y_train, X_test, Y_test, X, Y):
    """
    Train and evaluate machine learning models using pipelines and grid search.
    
    The function trains two models: Logistic Regression and LinearSVC. It uses a pipeline to preprocess the data 
    with TfidfVectorizer for the 'title' column and StandardScaler for numeric features. Hyperparameters are optimized 
    using GridSearchCV, and the models are evaluated on the test set and using cross-validation.
    
    Parameters:
        X_train (pd.DataFrame): Training data features.
        Y_train (pd.Series): Training data labels.
        X_test (pd.DataFrame): Test data features.
        Y_test (pd.Series): Test data labels.
        X (pd.DataFrame): Complete data features for cross-validation.
        Y (pd.Series): Complete data labels for cross-validation.
    
    Returns:
        dict: A dictionary with model names as keys and their best estimators as values.
    
    Note:
        Requires NLTK for stopwords, scikit-learn for modeling and evaluation, and spaCy for text processing.
    """
    # Create a transformer that applies TfidfVectorizer to the 'title' column and StandardScaler to the numeric features.
    preprocessor = ColumnTransformer(
        transformers=[
            ('text', TfidfVectorizer(stop_words=nltk.corpus.stopwords.words('english')), 'title'),
            ('num', StandardScaler(), [col for col in X.columns if col != 'title'])
        ])

    # Define pipelines for each model
    pipelines = {
        'Logistic Regression': Pipeline([('preprocessor', preprocessor),
                                        ('clf', LogisticRegression(max_iter=1000))]),
        'LinearSVC': Pipeline([('preprocessor', preprocessor),
                                ('clf', LinearSVC(dual=False))])
    }

    # Define parameters for grid search
    param_grid = {
        'preprocessor__text__max_df': [0.85, 0.9, 0.95],
        'preprocessor__text__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'clf__C': [0.1, 1, 10]
    }

    # Optimize hyperparameters and evaluate each model
    best_estimators = {}
    for name, pipeline in pipelines.items():
        grid_search = GridSearchCV(pipeline, param_grid, cv=5)
        grid_search.fit(X_train, Y_train)
        best_estimators[name] = grid_search.best_estimator_

        y_pred = best_estimators[name].predict(X_test)
        accuracy = accuracy_score(Y_test, y_pred)
        print(f"Accuracy of the optimized {name} model: {accuracy:.4f}")

        cv_accuracy = cross_val_score(best_estimators[name], X, Y, cv=5, scoring='accuracy').mean()
        print(f"Average cross-validation accuracy for {name}: {cv_accuracy:.4f}\n")

    return best_estimators


## Validación del modelo

In [16]:
def validate_model(best_estimators, X_test, Y_test, X_train, Y_train, X, Y):
    """
    Validate the performance of an ensemble model created using the best estimators.
    
    The function creates an ensemble model using the VotingClassifier with the best estimators provided. 
    It then fits the ensemble model on the training data and evaluates its performance on the test set 
    and using cross-validation.
    
    Parameters:
        best_estimators (dict): A dictionary with model names as keys and their best estimators as values.
        X_test (pd.DataFrame): Test data features.
        Y_test (pd.Series): Test data labels.
        X_train (pd.DataFrame): Training data features.
        Y_train (pd.Series): Training data labels.
        X (pd.DataFrame): Complete data features for cross-validation.
        Y (pd.Series): Complete data labels for cross-validation.
    
    Returns:
        VotingClassifier: The trained ensemble model.
    
    Note:
        Requires scikit-learn for modeling and evaluation.
    """
    ensemble_model = VotingClassifier(estimators=[
        ('Logistic Regression', best_estimators['Logistic Regression']),
        ('LinearSVC', best_estimators['LinearSVC'])
    ], voting='hard')

    ensemble_model.fit(X_train, Y_train)
    ensemble_predictions = ensemble_model.predict(X_test)
    ensemble_accuracy = accuracy_score(Y_test, ensemble_predictions)
    print("\nEnsemble model accuracy:", ensemble_accuracy)

    ensemble_cv_accuracy = cross_val_score(ensemble_model, X, Y, cv=5, scoring='accuracy').mean()
    print(f"Average cross-validation accuracy for the ensemble model: {ensemble_cv_accuracy:.4f}\n")

    return ensemble_model


## Guardado del Modelo

In [17]:
import pickle

def save_model(model, filename="model_KikeMau.pickle"):
    """
    Save the trained machine learning model to a file using pickle.
    
    Parameters:
        model (object): The trained machine learning model to be saved.
        filename (str, optional): The name of the file where the model will be saved. 
                                  Defaults to "model_KikeMau.pickle".
    
    Returns:
        None
    
    Note:
        The function will print a message indicating the location where the model was saved.
    """
    # Save the model
    pickle.dump(model, open(filename, "wb"))
    print(f"Model saved in {filename}")


## Pipeline de todo el proceso

In [18]:
def main_pipeline():
    """
    Execute the main pipeline that encompasses the entire process of loading the dataset, data augmentation, 
    preprocessing, feature engineering, model training, validation, and saving the trained model.
    
    The pipeline performs the following steps:
    1. Load the dataset.
    2. Check for class imbalance and augment data if necessary.
    3. Apply text preprocessing.
    4. Apply feature engineering.
    5. Split the data into training and test sets.
    6. Train machine learning models and validate their performance.
    7. Train an ensemble model and validate its performance.
    8. Save the trained ensemble model to a file.
    
    Returns:
        None
    
    Note:
        The function will print various messages indicating the progress and results at each step.
    """
    df = load_dataset()

    # Check imbalance and augment data if necessary
    label_counts = df['label'].value_counts()
    majority_count = label_counts.max()
    minority_count = label_counts.min()

    while minority_count / majority_count < 0.7:
        df = augment(df)
        # Recalculate counts after augmentation
        label_counts = df['label'].value_counts()
        majority_count = label_counts.max()
        minority_count = label_counts.min()

    print("Shape of the dataset after Data Augmentation:", df.shape)

    # Apply Preprocessing
    df['title'] = df['title'].apply(preprocess_text)
    print("Shape of the dataset after Preprocessing:", df.shape)

    # Apply Feature Engineering
    df = feature_engineering(df)
    print("Shape of the dataset after Feature Engineering:", df.shape)

    Y = df['label']
    X = df.drop('label', axis=1)

    print("X Dimensions:", X.shape)
    print("Y Dimensions:", Y.shape)

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

    print("X_train Dimensions:", X_train.shape)
    print("Y_train Dimensions:", Y_train.shape)

    best_estimators = train_models(X_train, Y_train, X_test, Y_test, X, Y)
    validate_model(best_estimators, X_test, Y_test, X_train, Y_train, X, Y)

    ensemble_model = validate_model(best_estimators, X_test, Y_test, X_train, Y_train, X, Y)  # Capture the returned model

    # Save the ensemble model
    save_model(ensemble_model)


# Ejecutar el pipeline completo
main_pipeline()

Shape of the original dataset: (16823, 2)
Shape of the dataset after Data Augmentation: (27101, 2)
Shape of the dataset after Preprocessing: (27101, 2)
Shape of the dataset after Feature Engineering: (27101, 13)
X Dimensions: (27101, 12)
Y Dimensions: (27101,)
X_train Dimensions: (21680, 12)
Y_train Dimensions: (21680,)
Accuracy of the optimized Logistic Regression model: 0.8585
Average cross-validation accuracy for Logistic Regression: 0.8470

Accuracy of the optimized LinearSVC model: 0.8659
Average cross-validation accuracy for LinearSVC: 0.8516


Ensemble model accuracy: 0.8625714812765173
Average cross-validation accuracy for the ensemble model: 0.8492


Ensemble model accuracy: 0.8625714812765173
Average cross-validation accuracy for the ensemble model: 0.8492

Model saved in model_KikeMau.pickle


## Prueba del modelo (Parte mas importante)

In [19]:
def test_model(model_filename="model_KikeMau.pickle", csv_filename="DataSet para entrenamiento del modelo.csv"):
    """
    Load a trained machine learning model from a pickle file and test its performance using data from a CSV file.
    
    Parameters:
        model_filename (str, optional): The name of the pickle file where the model is saved. 
                                         Defaults to "model_KikeMau.pickle".
        csv_filename (str, optional): The name of the CSV file containing the test data. 
                                      Defaults to "DataSet para entrenamiento del modelo.csv".
    
    Returns:
        None
    
    Note:
        The function will print the accuracy of the loaded model on the test data.
    """
    # Load the model
    loaded_model = pickle.load(open(model_filename, "rb"))
    
    # Load the dataset
    df = pd.read_csv(csv_filename)
    
    # Preprocess the data
    df['title'] = df['title'].apply(preprocess_text)
    df = feature_engineering(df)
    
    Y_test = df['label']
    X_test = df.drop('label', axis=1)
    
    # Predict using the loaded model
    y_pred = loaded_model.predict(X_test)
    accuracy = accuracy_score(Y_test, y_pred)
    print(f"Accuracy of the loaded model: {accuracy:.4f}")

test_model()


Accuracy of the loaded model: 0.9583
