# Data Science - Social Media Analytics SoSe 24 üìäüîç

## Problemset 4 üìù

This notebook represents my submission for the weekly tasks in Social Media Analytics for the summer semester of 2024.

### Authors üë•
- **Martin Brucker** (942815) üßë‚Äçüíª
- **Frederik Brinkmann** (943915) üßë‚Äçüíª

**Due**: Friday, 3 May 2024, 11:59 PM

**Contact Information**: martin.brucker@student.fh-kiel.de üìß

### Exercise 1
Vectorize the texts using Bag of Words and choose a baseline machine learning algorithm with default hyperparameters. Vary ONLY your text preprocessing strategy (e.g. lowercasing, stopword removal, stemming, lemmatization, or some combination of these). Keep the rest of your setup fixed.

Evaluate the performance of your models on a holdout set. Provide a table with the evaluation results. The table should also show the number of features used in each case.

Briefly summarize your main findings.

#### Installation of necessary packages

In [10]:
# !pip install nltk
# !pip install scikit-learn
# !pip install pandas
# !pip install langdetect

#### Library imports

In [11]:
# Natural Language Toolkit (NLTK) for text preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Vectorizers for converting text data into numerical features
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Scikit-learn methods and models for machine learning tasks
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score

# Data handling libraries
import pandas as pd
import numpy as np

# Optional library for language detection of the texts
from langdetect import detect

#### Load dataset

In [12]:
# Load the restaurant reviews dataset from a CSV file
restaurant_reviews_df = pd.read_csv("data/restaurant-reviews.csv")
restaurant_reviews_df.head()

Unnamed: 0,name,restaurant_url,title,text,rating
0,Manufactur,https://www.tripadvisor.com/Restaurant_Review-...,Best in Kiel,The absolutely best restaurant in the town of ...,5.0
1,Manufactur,https://www.tripadvisor.com/Restaurant_Review-...,"Simply, tasty and very good",Tasty and high quality food! A ‚Äúhealthier‚Äùway ...,5.0
2,Manufactur,https://www.tripadvisor.com/Restaurant_Review-...,Delicious fast food!,The food was more than we asked for and we whe...,5.0
3,Manufactur,https://www.tripadvisor.com/Restaurant_Review-...,Manufactur,They have some amazing service amzing food and...,5.0
4,Manufactur,https://www.tripadvisor.com/Restaurant_Review-...,clear but appealing menu: you will find what y...,Manufaktur is really a nice small self service...,5.0


#### Class imbalance and language analysis

##### Class Imbalance

In [13]:
# Count the occurrences of each unique rating in the dataset
restaurant_reviews_df.rating.value_counts()

rating
5.0    469
4.0    333
3.0    104
2.0     54
1.0     40
Name: count, dtype: int64

The data show a strong class imbalance toward 4 and 5 star ratings. This should be noted and addressed in further experiments that would be beyond the scope of these exercises.



##### Detecting non-english texts

In [14]:
# Detect the language of each review text
restaurant_reviews_df["lang"] = restaurant_reviews_df["text"].apply(detect)

# Count the occurrences of each detected language in the dataset
restaurant_reviews_df.lang.value_counts()

lang
en    963
de     36
ca      1
Name: count, dtype: int64

Although the review texts should be in English, some of the texts seem to be in German. This can optionally be addressed by filtering for all texts detected as English.

In [15]:
# Optional Data Preprocessing: Remove non-English reviews from the dataset
# restaurant_reviews_df = restaurant_reviews_df[restaurant_reviews_df.lang == "en"]

In [16]:
# Download necessary NLTK resources
nltk.download("stopwords")
nltk.download("wordnet")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mbrucker/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/mbrucker/nltk_data...


True

In [17]:
# Initialize stemmer, lemmatizer, and stop words list for English text preprocessing
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

Note: There are many different options when it comes to stemmers and lemmatizers.

##### Function Definitions

In [18]:
def preprocess_text(text, lower=False, remove_stopwords=False, stemming=False, lemmatize=False):
    """
    Preprocesses the input text with specified preprocessing steps.

    Parameters:
    - text: Input text to be preprocessed
    - lower: If True, convert text to lowercase (default: False)
    - remove_stopwords: If True, remove stopwords from text (default: False)
    - stemming: If True, apply stemming to words in text (default: False)
    - lemmatize: If True, apply lemmatization to words in text (default: False)

    Returns:
    - Preprocessed text
    """
    if lower:
        text = text.lower()

    words = text.split()

    if remove_stopwords:
        words = [word for word in words if word not in stop_words]

    if stemming:
        words = [stemmer.stem(word) for word in words]

    if lemmatize:
        words = [lemmatizer.lemmatize(word) for word in words]

    return " ".join(words)

In [19]:
def vectorize_data(vectorizer, data, preprocess_config):
    """
    Vectorizes text data using a specified vectorizer.

    Parameters:
    - vectorizer: Vectorizer object (e.g., CountVectorizer, TfidfVectorizer)
    - data: DataFrame containing text data
    - preprocess_config: Dictionary specifying preprocessing steps

    Returns:
    - Features
    """
    # Preprocess text data
    preprocessed_texts = data["text"].apply(preprocess_text, **preprocess_config)

    # Vectorize preprocessed text using the specified vectorizer
    features = vectorizer.fit_transform(preprocessed_texts)

    return features

##### Preprocessing Configurations

In [20]:
# All combinations of preprocessing configurations
preprocessing_configs = [
    {"lower": False, "remove_stopwords": False, "stemming": False, "lemmatize": False},
    {"lower": False, "remove_stopwords": False, "stemming": False, "lemmatize": True},
    {"lower": False, "remove_stopwords": False, "stemming": True, "lemmatize": False},
    {"lower": False, "remove_stopwords": False, "stemming": True, "lemmatize": True},
    {"lower": False, "remove_stopwords": True, "stemming": False, "lemmatize": False},
    {"lower": False, "remove_stopwords": True, "stemming": False, "lemmatize": True},
    {"lower": False, "remove_stopwords": True, "stemming": True, "lemmatize": False},
    {"lower": False, "remove_stopwords": True, "stemming": True, "lemmatize": True},
    {"lower": True, "remove_stopwords": False, "stemming": False, "lemmatize": False},
    {"lower": True, "remove_stopwords": False, "stemming": False, "lemmatize": True},
    {"lower": True, "remove_stopwords": False, "stemming": True, "lemmatize": False},
    {"lower": True, "remove_stopwords": False, "stemming": True, "lemmatize": True},
    {"lower": True, "remove_stopwords": True, "stemming": False, "lemmatize": False},
    {"lower": True, "remove_stopwords": True, "stemming": False, "lemmatize": True},
    {"lower": True, "remove_stopwords": True, "stemming": True, "lemmatize": False},
    {"lower": True, "remove_stopwords": True, "stemming": True, "lemmatize": True}
]

#### Constants

In [21]:
# Seed for random number generation to ensure reproducibility
RANDOM_STATE = 42

# Proportion of data allocated for testing
TEST_SIZE = 0.25

#### Analysis of different combinations of preprocessing techniques

Multinomial Naive Bayes was chosen as the baseline model because (Multinomial) Naive Bayes is widely used as a baseline model in real-world natural language processing due to its simplicity.

For the F1 score, the weighted average is used to account for the highly unbalanced class distribution of the dataset.

In [22]:
# Initialize CountVectorizer (Bag of Words) for converting text data into token counts
vectorizer = CountVectorizer()

In [23]:
def analyze_preprocessing_techniques(df, configs):
    """
    Analyzes the effect of different preprocessing techniques on classification performance.

    Parameters:
    - df: DataFrame containing text data and labels
    - configs: List of dictionaries specifying different preprocessing configurations

    Returns:
    - DataFrame containing results sorted by accuracy and F1 score
    """
    labels = df["rating"]
    results = []

    for config in configs:
        features = vectorize_data(vectorizer, df, config)

        X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=TEST_SIZE, shuffle=True, random_state=RANDOM_STATE)

        model = MultinomialNB()
        model.fit(X_train, y_train)

        predictions = model.predict(X_test)

        acc = accuracy_score(y_test, predictions)
        f1 = f1_score(y_test, predictions, average="weighted")

        results.append({
            "Lower": config["lower"],
            "Remove Stopwords": config["remove_stopwords"],
            "Stemming": config["stemming"],
            "Lemmatize": config["lemmatize"],
            "Accuracy": acc,
            "F1 Score": f1,
            "Feature Count": X_train.shape[1]
        })

    results_df = pd.DataFrame(results)

    sorted_results_df = results_df.sort_values(by=["Accuracy", "F1 Score"], ascending=False)

    return sorted_results_df

In [24]:
# Analyze the effect of different preprocessing techniques on model performance
preprocessing_techniques_df = analyze_preprocessing_techniques(restaurant_reviews_df, preprocessing_configs)
preprocessing_techniques_df

Unnamed: 0,Lower,Remove Stopwords,Stemming,Lemmatize,Accuracy,F1 Score,Feature Count
4,False,True,False,False,0.592,0.556436,6045
9,True,False,False,True,0.592,0.550583,5888
5,False,True,False,True,0.588,0.550838,5896
12,True,True,False,False,0.584,0.551881,6028
1,False,False,False,True,0.584,0.543905,5914
13,True,True,False,True,0.576,0.542476,5854
6,False,True,True,False,0.572,0.531089,5714
7,False,True,True,True,0.568,0.527816,5710
0,False,False,False,False,0.564,0.524023,6060
8,True,False,False,False,0.564,0.524023,6060


|   | Lower | Remove Stopwords | Stemming | Lemmatize | Accuracy | F1 Score | Feature Count |
|---|-------|------------------|----------|-----------|----------|----------|---------------|
| 4 | False | True             | False    | False     | 0.592    | 0.556    | 6045          |
| 9 | True  | False            | False    | True      | 0.592    | 0.551    | 5888          |
| 5 | False | True             | False    | True      | 0.588    | 0.551    | 5896          |
| 12| True  | True             | False    | False     | 0.584    | 0.552    | 6028          |
| 1 | False | False            | False    | True      | 0.584    | 0.544    | 5914          |
| 13| True  | True             | False    | True      | 0.576    | 0.542    | 5854          |
| 6 | False | True             | True     | False     | 0.572    | 0.531    | 5714          |
| 7 | False | True             | True     | True      | 0.568    | 0.528    | 5710          |
| 0 | False | False            | False    | False     | 0.564    | 0.524    | 6060          |
| 8 | True  | False            | False    | False     | 0.564    | 0.524    | 6060          |
| 15| True  | True             | True     | True      | 0.560    | 0.525    | 5691          |
| 14| True  | True             | True     | False     | 0.560    | 0.524    | 5694          |
| 3 | False | False            | True     | True      | 0.552    | 0.509    | 5728          |
| 11| True  | False            | True     | True      | 0.552    | 0.509    | 5728          |
| 2 | False | False            | True     | False     | 0.544    | 0.501    | 5732          |
| 10| True  | False            | True     | False     | 0.544    | 0.501    | 5732          |


From the results we can see that stemming (at least with the stemmer chosen for this particular dataset) is not a good choice of preprocessing technique. The best 6 runs according to accuracy and F1 score have no stemming applied, while the worst 6 runs have stemming applied. So in this particular case, stemming is not a good choice.

Furthermore, the results for no preprocessing technique applied and all given preprocessing techniques applied are in the middle range when it comes to accuracy and F1 score.

The best 5 runs have at most 2 preprocessing techniques applied. However, some runs with 3 preprocessing techniques applied are not that far from the accuracy and F1 score of the top 5 runs.

When it comes to the number of features of each run, there are some differences, but they are within a reasonable range of less than 400 features difference between the run with the most features and the run with the least features.

In [25]:
# Extract the best preprocessing configuration from the analysis results DataFrame
best_config = preprocessing_techniques_df.rename(columns={
    "Lower": "lower",
    "Remove Stopwords": "remove_stopwords",
    "Stemming": "stemming",
    "Lemmatize": "lemmatize"
})[["lower", "remove_stopwords", "stemming", "lemmatize"]].iloc[0].to_dict()

# Display the best configuration as a dictionary
best_config

{'lower': False,
 'remove_stopwords': True,
 'stemming': False,
 'lemmatize': False}

The best combination of preprocessing techniques for this particular dataset with the chosen stemmer and lemmatizer is
- No lowercase
- Remove stop words
- No stemming
- No lemmatization

### Exercise 2

Pick the best preprocessing strategy from Exercise 1 and vary ONLY your feature engineering strategy: bag of words, TF-IDF, bag of 2-grams. Keep the rest of your setup fixed.

Evaluate the performance of your models on a holdout set. Provide a table with the evaluation results.

Briefly summarize your main findings.

#### Analysis of different feature engineering methods

In [26]:
# Dictionary of different vectorizer configurations
vectorizer_configs = {
    "Bag of Words": CountVectorizer(),
    "TF-IDF": TfidfVectorizer(),
    "Bag of 2-grams": CountVectorizer(ngram_range=(1, 2))
}

In [27]:
def analyze_vectorizers(df, configs):
    """
    Analyzes the performance of different vectorization techniques on classification.

    Parameters:
    - df: DataFrame containing text data and labels
    - configs: Dictionary of vectorizer configurations

    Returns:
    - DataFrame containing results sorted by accuracy and F1 score
    """
    labels = df["rating"]
    results = []

    for name, vectorizer in configs.items():
        features = vectorize_data(vectorizer, df, best_config)

        X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=TEST_SIZE, shuffle=True, random_state=RANDOM_STATE)

        model = MultinomialNB()
        model.fit(X_train, y_train)

        predictions = model.predict(X_test)

        acc = accuracy_score(y_test, predictions)
        f1 = f1_score(y_test, predictions, average="weighted")

        results.append({
            "Vectorizer": name,
            "Accuracy": acc,
            "F1 Score": f1,
            "Feature Count": features.shape[1]
        })

    results_df = pd.DataFrame(results)

    sorted_results_df = results_df.sort_values(by=["Accuracy", "F1 Score"], ascending=False)

    return sorted_results_df

In [28]:
feature_engineering_df = analyze_vectorizers(restaurant_reviews_df, vectorizer_configs)
feature_engineering_df.head()

Unnamed: 0,Vectorizer,Accuracy,F1 Score,Feature Count
0,Bag of Words,0.592,0.556436,6045
2,Bag of 2-grams,0.588,0.559372,35218
1,TF-IDF,0.48,0.334423,6045


|   | Vectorizer      | Accuracy | F1 Score | Feature Count |
|---|-----------------|----------|----------|---------------|
| 0 | Bag of Words    | 0.592    | 0.556    | 6045          |
| 2 | Bag of 2-grams  | 0.588    | 0.559    | 35218         |
| 1 | TF-IDF          | 0.480    | 0.334    | 6045          |

Changing the feature engineering technique (the vectorizer) has a significant impact on the accuracy and F1 score of the runs. While Bag of Words remains the same (since nothing has changed), Bag of 2-grams is slightly worse in accuracy but slightly better in F1 score. The slightly better F1 score comes at the cost of a much higher feature count.

TF-IDF performs much worse in accuracy and even worse in F1 score for the given configuration of preprocessing techniques, and is far from the results of the other feature engineering techniques.

In [29]:
# Retrieve the best vectorizer configuration from the feature engineering results DataFrame
best_vectorizer_name = feature_engineering_df.iloc[0]["Vectorizer"]
best_vectorizer = vectorizer_configs[best_vectorizer_name]

# Display the best vectorizer
best_vectorizer

### Exercise 3

Choose one of the aspects to further improve the performance of the model (e.g. using different machine learning algorithms, tuning hyperparameters of a given algorithm, or also different strategies for handling class imbalance) and evaluate its effectiveness.

Briefly summarize your main findings.

#### Analysis of Random Search Cross-Validation as a method to further improve the models performance

There are several ways one could try to further improve the performance of the model. Besides data cleaning, one example would be to perform a cross-validation. Instead of performing a grid search cross-validation, it is also possible to perform a random search cross-validation, where the hyperparameters are not defined by hand, but are randomly drawn from defined distributions to improve the results.

In [30]:
def perform_random_search_cv(df, vectorizer, hyperparameter_candidates):
    """
    Performs random search cross-validation to find the best hyperparameters for Multinomial Naive Bayes model.

    Parameters:
    - df: DataFrame containing text data and labels
    - vectorizer: Vectorizer object
    - hyperparameter_candidates: Dictionary containing hyperparameter distributions

    Returns:
    - Tuple containing the best cross-validation score and the corresponding best hyperparameters
    """
    model = MultinomialNB()

    random_search = RandomizedSearchCV(model, hyperparameter_candidates, n_iter=100, cv=5, n_jobs=-1, random_state=RANDOM_STATE)

    features = vectorize_data(vectorizer, df, best_config)
    labels = df["rating"]

    random_search.fit(features, labels)

    return random_search.best_score_, random_search.best_params_

In [31]:
# Dictionary of hyperparameter distributions for random search cross-validation
hyperparameter_candidates = {
    "alpha": np.logspace(-4, 1, 20),
    "fit_prior": [True, False],
    "class_prior": [None, [0.2, 0.2, 0.2, 0.2, 0.2]]
}

In [32]:
# Perform random search cross-validation to find the best hyperparameters for the Multinomial Naive Bayes model
best_score, best_params = perform_random_search_cv(restaurant_reviews_df, best_vectorizer, hyperparameter_candidates)



In [33]:
best_score

0.549

In [34]:
best_params

{'fit_prior': False, 'class_prior': None, 'alpha': 1.623776739188721}

In [35]:
results = []

features = vectorize_data(best_vectorizer, restaurant_reviews_df, best_config)
labels = restaurant_reviews_df["rating"]

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=TEST_SIZE, shuffle=True, random_state=RANDOM_STATE)

# Initialize and train the Multinomial Naive Bayes model with the best hyperparameters
model = MultinomialNB(alpha=best_params["alpha"], fit_prior=best_params["fit_prior"], class_prior=best_params["class_prior"])
model.fit(X_train, y_train)

predictions = model.predict(X_test)

acc = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions, average="weighted")

results.append({
    "Method": "Random Search Cross-Validation",
    "Accuracy": acc,
    "F1 Score": f1,
    "Feature Count": features.shape[1]
})

results_df = pd.DataFrame(results)

sorted_results_df = results_df.sort_values(by=["Accuracy", "F1 Score"], ascending=False)

sorted_results_df

Unnamed: 0,Method,Accuracy,F1 Score,Feature Count
0,Random Search Cross-Validation,0.6,0.55686,6045


|   | Method                          | Accuracy | F1 Score | Feature Count |
|---|--------------------------------|----------|----------|---------------|
| 0 | Random Search Cross-Validation | 0.600    | 0.557    | 6045          |

Using random search cross-validation, the accuracy and F1 score see slight improvements. These are only small improvements compared to the previous baselines, so other more sophisticated approaches should be taken to improve performance in order to get a good working model. Using this configuration, a production model would then be trained on the entire dataset to include all available data in the training.