# Assignment 4: Pipelines and Text Data (60 total marks)
### Due: March 21 at 11:59pm

### Name: 

In [None]:
import numpy as np
import pandas as pd

In [None]:
import warnings
warnings.filterwarnings('ignore') #ignoring some deprication warnings

## Part 1: Pipelines (26 marks)

The purpose of this part of the assignment is to practice following the grid-search workflow: 
- Split data into training and test set
- Use the training portion to find the best model using grid search and cross-validation
- Retrain the best model
- Evaluate the retrained model on the test set

### 1.1: Load data (4 marks)
For this task, we will be using stock data from the Dow Jones Index. This dataset uses information about different stocks to try to predict what the percent change in price will be from week to week.

More information on the dataset can be found here: https://archive.ics.uci.edu/dataset/312/dow+jones+index

In [None]:
# TO DO: Load the dataset into a dataframe called stock_data (0.5 marks)
stock_data = pd.read_csv('dow_jones_index.data')

# TO DO: Inspect the first few columns (0.5 marks)

print(stock_data.head())


   quarter stock       date    open    high     low   close     volume  \
0        1    AA   1/7/2011  $15.82  $16.72  $15.78  $16.42  239655616   
1        1    AA  1/14/2011  $16.71  $16.71  $15.64  $15.97  242963398   
2        1    AA  1/21/2011  $16.19  $16.38  $15.60  $15.79  138428495   
3        1    AA  1/28/2011  $15.87  $16.63  $15.82  $16.13  151379173   
4        1    AA   2/4/2011  $16.18  $17.39  $16.18  $17.14  154387761   

   percent_change_price  percent_change_volume_over_last_wk  \
0               3.79267                                 NaN   
1              -4.42849                            1.380223   
2              -2.47066                          -43.024959   
3               1.63831                            9.355500   
4               5.93325                            1.987452   

   previous_weeks_volume next_weeks_open next_weeks_close  \
0                    NaN          $16.71           $15.97   
1            239655616.0          $16.19           $15

In [None]:
# TO DO: Check the data types of each column and if there are missing values (0.5 marks)

print(stock_data.dtypes)
print(stock_data.isnull().sum())


quarter                                 int64
stock                                  object
date                                   object
open                                   object
high                                   object
low                                    object
close                                  object
volume                                  int64
percent_change_price                  float64
percent_change_volume_over_last_wk    float64
previous_weeks_volume                 float64
next_weeks_open                        object
next_weeks_close                       object
percent_change_next_weeks_price       float64
days_to_next_dividend                   int64
percent_return_next_dividend          float64
dtype: object
quarter                                0
stock                                  0
date                                   0
open                                   0
high                                   0
low                                    0
clos

You should notice in this dataset that there are multiple columns that look numerical, but include a `$` that turns the value into a string (type object). You can use the code below to convert these columns into numerical ones:

In [None]:
# TO DO: Fill-in which columns need the $ to be removed (1 mark)
columns = ['open', 'high', 'low', 'close', 'next_weeks_open', 'next_weeks_close']   

# Code to remove $ - DO NOT CHANGE
stock_data[columns] = stock_data[columns].replace('[\$]', '', regex=True).astype(float)

# TO DO: Inspect first few rows to make sure it worked (0.5 marks)
print(stock_data.head())


   quarter stock       date   open   high    low  close     volume  \
0        1    AA   1/7/2011  15.82  16.72  15.78  16.42  239655616   
1        1    AA  1/14/2011  16.71  16.71  15.64  15.97  242963398   
2        1    AA  1/21/2011  16.19  16.38  15.60  15.79  138428495   
3        1    AA  1/28/2011  15.87  16.63  15.82  16.13  151379173   
4        1    AA   2/4/2011  16.18  17.39  16.18  17.14  154387761   

   percent_change_price  percent_change_volume_over_last_wk  \
0               3.79267                                 NaN   
1              -4.42849                            1.380223   
2              -2.47066                          -43.024959   
3               1.63831                            9.355500   
4               5.93325                            1.987452   

   previous_weeks_volume  next_weeks_open  next_weeks_close  \
0                    NaN            16.71             15.97   
1            239655616.0            16.19             15.79   
2          

In [None]:
# TO DO: Check data type of each column to make sure that the type of the columns selected has changed (0.5 marks)
print(stock_data.dtypes)


quarter                                 int64
stock                                  object
date                                   object
open                                  float64
high                                  float64
low                                   float64
close                                 float64
volume                                  int64
percent_change_price                  float64
percent_change_volume_over_last_wk    float64
previous_weeks_volume                 float64
next_weeks_open                       float64
next_weeks_close                      float64
percent_change_next_weeks_price       float64
days_to_next_dividend                   int64
percent_return_next_dividend          float64
dtype: object


The first thing we need to do is deal with missing values. Looking at the dataset, there are two columns with 30 missing values. For this case, we will drop these rows instead of filling them in.

In [None]:
# TO DO: Drop rows with missing data (0.5 marks)

stock_data = stock_data.dropna()



### 1.2: Pre-processing (4 marks)

In this dataset, we have columns with:
- Categorical values
- Numerical values

We need to create a column transformer that will use the proper preprocessing methods on each type of column.

In [None]:
# TO DO: Create Column Transformer using an encoder and StandardScaler (1 mark)
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = ColumnTransformer(
    [("scaling", StandardScaler(), ['quarter', 'open', 'high', 'low', 'close', 'volume', 'next_weeks_open', 'next_weeks_close']),
     ("onehot", OneHotEncoder(sparse_output=False), ['stock'])])


In [None]:
# TO DO: Initialize your pipeline with your column transformer and the Ridge Regression model (1 mark)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge

pipeline = Pipeline(steps=[
    ('preprocessor', ct), ('classifier', Ridge())          
])


In [None]:
# TO DO: Separate data into feature matrix and target vector (1 mark)

X = stock_data.drop(columns=['percent_change_next_weeks_price'])
y = stock_data['percent_change_next_weeks_price']


In [None]:
# TO DO: Split data into training and testing sets (use random_state=0 and 10% of the data for testing) (0.5 marks)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

Create another column transformer that does not implement scaling

In [None]:
# TO DO: Create a new column transformer that only performs encoding (0.5 marks)

ct_encoding = ColumnTransformer(
    [("onehot", OneHotEncoder(), ['stock'])])


### 1.3: Grid Search (4 marks)

For the grid search, we want to compare the performance of the Random Forest model to a Ridge Regression model with the two different column transformers. Think about if we need to use scaling for both models. Select parameter values to test that make sense for both models.

In [None]:
# TO DO: Create parameter grid and initialize grid object (3 marks)

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = [
    {'classifier': [Ridge()], 'classifier__alpha': [0.1, 1.0, 10.0]},
    {'classifier': [RandomForestRegressor()], 'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [None, 10, 20]}
]

grid = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')

In [None]:
# TO DO: Fit grid object to training data (1 mark)
grid.fit(X_train, y_train)

### 1.4: Visualize Results (2 marks)

The final step is to print out the results from the grid search. You will need to print out the following items:
- Best parameters
- Best cross-validation train score 
- Best cross-validation test score
- Test set accuracy

In [None]:
# TO DO: Print the results from the grid search (2 marks)

print("Best parameters:", grid.best_params_)
print("Best cross-validation train score:", -grid.best_score_)
print("Best cross-validation test score:", -grid.cv_results_['mean_test_score'][grid.best_index_])
test_score = grid.score(X_test, y_test)
print("Test set accuracy (MSE):", -test_score)


Best parameters: {'classifier': Ridge(), 'classifier__alpha': 0.1}
Best cross-validation train score: 2.343672115286629
Best cross-validation test score: 2.343672115286629
Test set accuracy (MSE): 2.9823291966135437


### Questions (8 marks)

1. Which models did you use scaling for? Why?
1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas.

*ANSWER HERE*

1. I used scaling for the Ridge Regression model because it is sensitive to the scale of the input features and this was decided since Random Forest is not sensitive to the scale of the input features since it is based on decision trees.

2. The model that produced the best results was the Ridge Regression model with prarameters of 

### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

## Part 2: Text Data (32 marks)

The purpose of this part of the assignment is to practice working with text data.

### 2.1: Load data (1 mark)
For this task, we will be using the hobbies dataset from the yellowbrick library. More information on the dataset can be found here: https://www.scikit-yb.org/en/latest/api/datasets/hobbies.html

In [None]:
# TO DO: Load the dataset (1 mark)

from yellowbrick.datasets import load_hobbies

data = load_hobbies()

data.info()


AttributeError: 'Corpus' object has no attribute 'info'

### 2.2 Pre-processing (3 marks)

We will need to transform the data from strings to numeric. First, we will transform the data using `CountVectorizer(min_df=5)`.

In [None]:
# TO DO: Create CountVectorizer object (0.5 marks)
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(min_df=5)


In [None]:
# TO DO: Fit vectorizer to data (0.5 marks)

vect.fit(data.data)

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [None]:
# TO DO: What is the length of the vocabulary? (0.5 marks)

vocab_length = len(vect.vocabulary_)


AttributeError: 'CountVectorizer' object has no attribute 'vocabulary_'

In [None]:
# TO DO: Transform the data (0.5 marks)

X = vect.transform(data.data)


NotFittedError: Vocabulary not fitted or provided

In [None]:
# TO DO: What is the shape of the transformed data? (0.5 marks)

print("Shape of transformed data:", X.shape)


Shape of transformed data: (720, 15)


In [None]:
# TO DO: Split data into training and testing sets (use random_state=0 and 10% of the data for testing) (0.5 marks)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, data.target, random_state=0, test_size=0.1)

ValueError: Found input variables with inconsistent numbers of samples: [720, 0]

### 2.3: Grid Search (5 marks)

For the grid search, we want to compare the performance of Logistic Regression for different values of C. Initialize the parameter grid with parameter values that make sense for this model.

In [None]:
# TO DO: Create parameter grid and initialize grid object (2 marks)

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=287)

param_grid = {'C': [0.01, 0.1, 1.0, 10.0]}

grid = GridSearchCV(LogisticRegression(max_iter=500), param_grid=param_grid, cv=cv, return_train_score=True)

In [None]:
# TO DO: Fit grid object to training data (1 mark)

grid.fit(X_train, y_train)

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

In [None]:
# TO DO: Print the results from the grid search (2 marks)

print("Best params:\\n{}\\n".format(grid.best_params_))
print("Best cross-validation train score: {:.2f}".format(grid.cv_results_['mean_train_score'][grid.best_index_]))
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))


AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

### 2.4: Additional Model Comparisons (9 marks)

### 2.4.1: Naive Bayes (3 marks)
We would like to compare the performance of Logistic Regression with one of the Naive Bayes models. Pick the Naive Bayes model that you think would best suit text data and implement below. Since we are not adjusting hyperparameters, we can use `cross_validate`.

In [None]:
# TO DO: Implement Naive Bayes model with cross-validate (2 marks)
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
nb = MultinomialNB()
scores = cross_val_score(nb, X_train, y_train, cv=cv)


# TO DO: Print training and validation accuracies
print("Training accuracy: {:.2f}".format(scores.mean()))
print("Validation accuracy: {:.2f}".format(scores.std()))


In [None]:
# TO DO: Calculate and print test accuracy (1 mark)
nb.fit(X_train, y_train)
test_accuracy = nb.score(X_test, y_test)
print("Test accuracy: {:.2f}".format(test_accuracy))


### 2.4.2 Tf-idf (6 marks)

To try to improve the results, we can try using Tf-idf to tranform the text data based on the importance of each feature. We will need to use a pipeline and the original data for this section. Use `TfidfVectorizer(min_df=5)` and compare the results for both Logistic Regression and your selected Naive Bayes model. Use the Logistic Regression parameters from the previous section.

In [None]:
# TO DO: Split the data into training and testing sets (same values as previous section) (1 mark)


In [None]:
# TO DO: Implement Pipeline with Tf-idf vectorizer and both Logistic Regression and your selected Naive Bayes model (3 marks)


In [None]:
# TO DO: Print the results from the grid search (2 marks)


### Questions (10 marks)

1. Which Naive Bayes model did you pick? Why?
1. Which model and what parameters produced the best results?
1. Was this model a good fit? Why or why not?
1. Is there anything else we could do to try to improve model performance? Provide two ideas (must be different from Part 1).
1. Why did we need to implement a pipeline for Tf-idf and not CountVectorizer? What would happen if we didn't use one for Tf-idf?

*ANSWER HERE*


### Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

*DESCRIBE YOUR PROCESS HERE - BE SPECIFIC*

## Part 3: Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*