# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: 

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

# Imports for text processing tasks
!pip install contractions > /dev/null
import contractions
import nltk # Imports the Natural Language Toolkit module
import re # regex

# Imports for pipelines, modeling, and evaluation
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [2]:
# Import dataset (1 mark)
from sklearn.datasets import fetch_20newsgroups

# I selected 3 categories that I like out of the 20 because every run was taking a loooooong time
# Note that because I am still using 3 categories, this is a multiclass classification problem
selected_categories = [
    'sci.electronics',   # Topics related to electronics
    'sci.med',           # Topics related to medical sciences
    'sci.space',         # Topics related to space science 
]

# Fetch only data for the selected topics
newsgroups = fetch_20newsgroups(subset='all', categories=selected_categories)
posts, targets = [s.strip() for s in newsgroups.data], newsgroups.target

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

1. My dataset is from sklearn datasets.
1. I picked this dataset because I wanted to practice implementing machine learning models for text-based data because soon I will be coding my own Amazon review sentiment analyser for my final project in the ensf 612 big data course using spark. However I also wanted to go one step ahead an do a multiclass classifier instead of a binary one.
1. Yes. I looked at many text-based datasets but most of them were for sentimen analysis using twitter data, amazon data or wine data. I thought classifying news was cool. I also struggled because originally I downloaded the 20newsgroups from the web as a .tar file with 20 folders each containing hundreds of individual document type files for each news publication. Later on I realized I could simply import it from sklean. And If it wasn't for sklearn I would've given up using this dataset.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [3]:
# Download necessary NLTK resources to process Text Data                   
_ = nltk.download('punkt', quiet=True)           # Downloads a pre-trained tokenizer models used to split sentences
_ = nltk.download('stopwords', quiet=True)       # Downloads a set of stopwords
_ = nltk.download('wordnet', quiet=True)         # Downloads a lexical database of English, used for lemmatization

In [4]:
# Clean data (if needed)

# Custom transformer for word expansion
class WordExpander(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [contractions.fix(text) for text in X]

# Custom transformer for text cleaning
class TextCleaning(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [re.sub(r'[^\w\s]', '', text.replace("'s", "")) for text in X]

# Custom transformer for tokenization
class Tokenizer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [nltk.word_tokenize(text) for text in X]

# Custom transformer for stop word removal
class StopwordRemover(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        stop_words = set(nltk.corpus.stopwords.words('english'))
        return [[word for word in tokens if word.lower() not in stop_words] for tokens in X]

# Custom transformer for lemmatization
class Lemmatizer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        lemmatizer = nltk.stem.WordNetLemmatizer()
        return [' '.join([lemmatizer.lemmatize(word) for word in tokens]) for tokens in X]

In [5]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed

Note: I did not use a column transformer because my dataset contains a single column and single type of data (text data). Thus all the preprocessing transformations are apply to the same single column.

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

1. Since this dataset is simply an array or list of posts (text data). This dataset does not contain "missing values" in the tradional sense. For this specific dataset the posts either exists or does not exist. However if the datset were a table or dataframe containing features in addition of the text data, then in that case I would need to fill in the missing values or remove the rows or columns missing a significant amount of data.
2. I have text based data. Each sample is a post and the target is a label of the topic of the post belongs to. As preprocessing steps I would have to apply text cleaning, tokenization, stop word removal, stemming or lemmatization and convert the text data into numerical data the computer can understand either using the world of bags strategy or TD-IDF.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [6]:
# Imports for supervised learning classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(posts, targets, random_state=0)

In [7]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

# Create a general pipeline with a placeholder for the classifier
pipeline = Pipeline([
    ('word_expander', WordExpander()),
    ('text_cleaning', TextCleaning()),
    ('tokenizer', Tokenizer()),
    ('stopword_remover', StopwordRemover()),
    ('lemmatizer', Lemmatizer()),
    ('vectorizer', TfidfVectorizer(max_features=5000)),
    ('classifier', None)  # Placeholder for the classifier
])

# Define a combined parameter grid
param_grid = [
    {'classifier': [LogisticRegression(max_iter=5000)],
     'classifier__C': [0.1, 1, 10],
     'classifier__penalty': ['l1', 'l2'],
     'classifier__solver': ['liblinear', 'saga']},
    {'classifier': [MultinomialNB()],
     'classifier__alpha': [0.1, 1, 10]},
    {'classifier': [KNeighborsClassifier()],
     'classifier__n_neighbors': [3, 5, 7],
     'classifier__weights': ['uniform', 'distance']}
]

# Define scoring metrics for all classifiers
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1_score': make_scorer(f1_score, average='weighted')
}

# Create GridSearchCV instance
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring=scoring, refit='f1_score')

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Print the best model parameters 
print("Best Overall Estimator:", grid_search.best_estimator_)
print("\nBest Overall Parameters:", grid_search.best_params_)

# Print the best model results
cv_results_df = pd.DataFrame(grid_search.cv_results_)

best_f1_score = cv_results_df.loc[grid_search.best_index_, 'mean_test_f1_score']
print(f'\nBest average Cross-Validation F1 Score: {best_f1_score:.2f}')

best_accuracy = cv_results_df.loc[grid_search.best_index_, 'mean_test_accuracy']
print(f'\nBest average Cross-Validation Accuracy: {best_accuracy:.2f}')

Best Overall Estimator: Pipeline(steps=[('word_expander', WordExpander()),
                ('text_cleaning', TextCleaning()), ('tokenizer', Tokenizer()),
                ('stopword_remover', StopwordRemover()),
                ('lemmatizer', Lemmatizer()),
                ('vectorizer', TfidfVectorizer(max_features=5000)),
                ('classifier',
                 LogisticRegression(C=10, max_iter=5000, solver='liblinear'))])

Best Overall Parameters: {'classifier': LogisticRegression(C=10, max_iter=5000, solver='liblinear'), 'classifier__C': 10, 'classifier__penalty': 'l2', 'classifier__solver': 'liblinear'}

Best average Cross-Validation F1 Score: 0.98

Best average Cross-Validation Accuracy: 0.98


### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

1. I need classification models since I am trying to assign each news post a discret label corresponding to electronics, medical science or space science.
1. I selected Logistic Regression because I had to choose 1 linear model, and because LogisticRegression is easy to implement and it is a good start specially for large datasets. I selected Multinomial Naive Bayes because several youtube videos mention NB as a standard and commonly used model for NPL because it treats the features independently and it makes it efficient for text data as the number of features is in the order of thousands. Finally I implemented K-Neighbors Classifier because we had not implemented yet in the previous assigments and since KNN works by comparing the position of features to its neighbors I thought KNN would be efficient for this classification problem.
1. Logistic Regression worked the best. I think it makes sense Logistic Regression outperformed the other two models if the dataset features follow a straight line which I think it is what happened here and that's why I also obtained high F1 scores.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [8]:
# Calculate testing accuracy (1 mark)
print(f'Test accuracy {grid_search.score(X_test, y_test):.2f}')

Test accuracy 0.98


### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

1. Because I am working with multiclass classification I chose F1 score with average='weighted'. This metric computes the F1 score for each class individually and then computes the weighted average based on the number of samples in each class. Although the 20newsgroups dataset is not an imbalaced dataset. I also decided to use F1 scores as a good practice since obtaining a good F1 score means that recall and precision are also high. Whereas only using accuracy can be misleading in imbalanced datasets. 
1. The testing accuracy of 0.98 score is very high and it matches the training accuracy score which is also 0.98. This means that the model is generalizing well in new unseen data. And at the same time it is not indicating overfitting or underfitting.
1. Yes the model has high precision and recall and it is classifying almost all of the posts correctly. I think the model is well suited for the real world as long as the new posts follow a similar format and that of course the new posts belong to one of the 3 categories. I do not have any specific suggestion o how to improve the analysis. I think that trying to improve this model is uneccesary because the model is already performing quite well and doing more heavy tunning or using more complex models can be a waste of resources and may not even yield any significant improvements. I also think it can be risky as we might end up overfitting due to increased model complexity. Finally, although in general for linear models more features can be beneficial. My personal opinion is that instead of trying to improve the model I would look into reducing the number of features even further. Note that I already  tried to reduce the features setting my TfidfVectorizer to max_features=5000. But still my computer took 4 minutes to run this notebook.

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

1. I sourced my code using code from my Amazon sentiment analysis review project using pyspark in ensf 612, specifically the part of how to do text pre-processing. And I also sourced my code from the pipelines examples on this course, especifically from the notebook Pipeline Steps.
1. I first looked into several datasets, once I chose 20News I made some file exploration, then I looked into how to adapt my pyspark code using the modules of sparknlp.base and sparknlp.annotator to achieve the same using sklearn.base and ntlk modules in regular python. Finally I reviewed the pipeline example notebook to complete the pipeline portion of the assigment. 
1. I used several prompts for dataset brainstorming. I asked "give examples of cool datasets for text data classification". I also asked chatgpt how to modify my existing pyspark code and turn it into regular python to do all the preprocessing and it hinted me to use the TfidfVectorizer built in functionality to do all the preproccesing. But instead I decided to create my own costum transformers to practice further. 
1. Yes I had several challenges. First I struggle to get the 20News dataset because originally I was dealing with a huge .tar file until I realized I could import it from sklearn. Second I struggle a lot using nltk because it required a lot of modules and external resources to make it work. And my program crashed a lot of times and it was running very slow so I had to reduce the number of features. Finally bulding the pipeline and applying it with GridSearchCV was also a little bit challenging but not as much as the preprocessing. A lot of patience, my previous practice using text data in pyspark and the resources provided in this course helped me be succesful.

## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.

I liked that we had the opportunity to work and learn on our own. I think I learnt a lot in this assigment, although it was also harder than the previous ones.
What I dislike is that at the same time it was hard for me to come with my own dataset and I am not very imaginative.

I found a lot of things challenging while working with text data. But I also liked it a lot and I would like to keep learning about AI and Natural language processing.