#Preface

Updated notebook slightly. For challenge 2, in the past I resorted to Googling for answers. Yes, I confess, I do not have the entire pandas and numpy libraries memorized by heart. This year I decided to try Gemini instead. As noted, it can be quite powerful in that it can read the text and code cells of the notebook to help it formulate an answer. I was impressed.

<center>
<h1>Chapter Three</h1>
</center>

<hr>

## LEARNING OBJECTIVES:
- Start your own GitHub library to store useful functions, classes, etc., that you can use in future.
- Another look at feature engineering using Pearson correlations. Capture your work in a custom Transformer.

#I. Capture past work
<img src='https://www.dropbox.com/s/9fcc1crlxp19ijt/major_section.png?raw=1' width='300'>

I'd like to avoid redefining classes and other things at the top of each subsequent chapter. I'd like you to use github as a place you can save the work you do, week by week, and then load it back in future weeks.

In the past, I've asked students to build a full-blown library on github and then use `import` to load it in. This year I would like to try something simpler (even though maybe not as elegant).
Just use `wget` command to download a python file (script) from github then run it in the notebook.

##For the brave

I actually set up github to publish my repository/library to PyPi. I used this guide: [github to PyPi](https://packaging.python.org/en/latest/guides/publishing-package-distribution-releases-using-github-actions-ci-cd-workflows/). Then you can just do this in your notebook:

<pre>
!pip install mylibrary
import mylibrary
</pre>

That said, it took me half a day to debug the guide and get things set up. If anyone wants to try, I'll help as I can.

For now, let's just use a quick way to get library code loaded.

#II. Save a script that will load past work

Follow these steps along with me.

1. Go to github and create a new repository. Call it what you want, e.g., `cis423`.

1. In your github repository, create a new python file. Call it  `library.py` or something similar.

2. Paste this code into that file.
<pre>
from __future__ import annotations  #must be first line in your library!
import pandas as pd
import numpy as np
import types
from typing import Dict, Any, Optional, Union, List, Set, Hashable, Literal, Tuple, Self, Iterable
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import sklearn
sklearn.set_config(transform_output="pandas")  #says pass pandas tables through pipeline instead of numpy matrices
</pre>

3. Click raw then copy the url.

4. Pull out the pieces to set the top 3 variables below.

5. Now run the cell below.





In [None]:
github_name = 'smith'  #fill in with your user name
repo_name = 'cis423'   #fill in with your repo name
source_file = 'library.py'  #fill in with file name

url = f'https://raw.githubusercontent.com/{github_name}/{repo_name}/main/{source_file}'
!rm $source_file
!wget $url
%run -i $source_file

The `wget` command will retrieve a file from a url and store it locally.

The `%run` command is called a Jupyter magic command. It allows you to run files from storage, local Colab storage in this case.

In [None]:
#Make sure we got what we wanted - these all should be defined vars now

annotations
pd
np
types
Dict, Any, Optional, Union, List, Set, Hashable, Literal, Tuple, Self, Iterable
BaseEstimator, TransformerMixin
Pipeline
set_config

##Bring in titanic data (trimmed)

This is from chapter 2. I did what I promised. I downloaded to my computer and then uploaded to a repository `course_datasets` on my GitHub account. I then chose the file and clicked Raw button. Finally I copied the url and pasted it in below.

In [None]:
url = 'https://raw.githubusercontent.com/fickas/asynch_models/refs/heads/main/datasets/titanic_trimmed.csv'
titanic_table = pd.read_csv(url)

In [None]:
titanic_table.head()  #print first 5 rows of the table

###Produce feature columns

In [None]:
titanic_features = titanic_table.drop(columns='Survived')

#Challenge 1

Let's start building up your library. We will be copying code over from chapter 2 mostly.

##Step 1.1

Add your transformer classes from chapter 2 to your library. Simple copy and paste. This should include

* CustomMappingTransformer
* CustomOHETransformer
* CustomDropColumnsTransformer


##Step 1.2

Add your two pipelines from chapter 2:

* titanic_transformer = Pipeline(...)
* customer_transformer = Pipeline(...)

Note that while your transformers will not change over time (unless you find a bug in them!), you will continue to build up these pipelines over the next several chapters.

Now rerun your script.

In [None]:
!rm $source_file
!wget $url
%run -i $source_file

In [None]:
titanic_transformer  #should produce nice picture

In [None]:
customer_transformer  #should produce nice picture

###You can click on the boxes to get more information

Run the Titanic pipeline.

In [None]:
transformed_df = titanic_transformer.fit_transform(titanic_features)

In [None]:
transformed_df.head()

###What I see

|index|Age|Gender|Class|Married|Fare|Joined\_Belfast|Joined\_Cherbourg|Joined\_Queenstown|Joined\_Southampton|
|---|---|---|---|---|---|---|---|---|---|
|0|41\.0|0|1\.0|0\.0|7\.0|0|0|0|1|
|1|21\.0|0|0\.0|0\.0|0\.0|0|0|0|1|
|2|13\.0|0|1\.0|NaN|20\.0|0|0|0|1|
|3|16\.0|0|1\.0|0\.0|NaN|0|0|0|1|
|4|NaN|0|2\.0|0\.0|24\.0|0|1|0|0|

#Challenge 2

In a chapter 1 challenge, we looked briefly at a method for computing column (feature) correlations. In this, challenge, I'd like to look at another called the Pearson correlation coefficient. You can read about it in link below. Probably good to know about the basics: it is the kind of question that could come up on a job interview. [Pearson CC](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).

Good news:
pandas has a method for computing a pairwise PCC for a dataframe and then putting that in a table. Check it out.

In [None]:
df_corr = transformed_df.corr(method='pearson')
df_corr

A cell value is in the range -1 (perfectly reversed/negatively correlated) to 1 (perfectly positively correlated). Looking at the table above I can see that there is a value of `0.3896` between `Fare` and `Class`. Is this high? It is a subjective question.

##pandas does not complete the process

The question is what to do with this correlation information. I'm going to suggest that if columns C1 and C2 are correlated above some threshold, that we drop C2. We don't need it. We can get by with just C1. This is called feature reduction.

The bad news is that pandas does not give us a way to get from the `corr` table to dropping columns. That's where you come in. Given the `corr` table above, I'll step you through getting to a place where we can drop columns. There are many ways to do this. I am using the one that makes the most sense to me.



##Step 1.1 Create a Boolean table

Each cell is `True` if it has a correlation greater than a threshold and `False` if not.

Your choice of threshold is, of course, critical. I chose `.4` because it gives us a column to drop. But it is probably too low in general.

Here is your target. You can double check with `corr` table above to verify that cells with `True` have `abs(PCC)>.4`. Remember that we want a True for values `< -.4`. Highly positive and highly negative both are reasons for concern.

|index|Age|Gender|Class|Married|Fare|Joined\_Belfast|Joined\_Cherbourg|Joined\_Queenstown|Joined\_Southampton|
|---|---|---|---|---|---|---|---|---|---|
|Age|true|false|false|false|false|false|false|false|false|
|Gender|false|true|false|false|false|false|false|false|false|
|Class|false|false|true|false|false|false|false|false|false|
|Married|false|false|false|true|false|false|false|false|false|
|Fare|false|false|false|false|true|false|false|false|false|
|Joined\_Belfast|false|false|false|false|false|true|false|false|false|
|Joined\_Cherbourg|false|false|false|false|false|false|true|false|true|
|Joined\_Queenstown|false|false|false|false|false|false|false|true|false|
|Joined\_Southampton|false|false|false|false|false|false|true|false|true|

In [None]:
threshold = .4

In [None]:
#Hint: there is a way to create a new table with only absolute values. No loops needed.
#And there is a way to create a True/False table based on a condition. No loops needed.
#And you can do it in one line if you feel brave!

masked_df =
masked_df

###What I get

|index|Age|Gender|Class|Married|Fare|Joined\_Belfast|Joined\_Cherbourg|Joined\_Queenstown|Joined\_Southampton|
|---|---|---|---|---|---|---|---|---|---|
|Age|true|false|false|false|false|false|false|false|false|
|Gender|false|true|false|false|false|false|false|false|false|
|Class|false|false|true|false|false|false|false|false|false|
|Married|false|false|false|true|false|false|false|false|false|
|Fare|false|false|false|false|true|false|false|false|false|
|Joined\_Belfast|false|false|false|false|false|true|false|false|false|
|Joined\_Cherbourg|false|false|false|false|false|false|true|false|true|
|Joined\_Queenstown|false|false|false|false|false|false|false|true|false|
|Joined\_Southampton|false|false|false|false|false|false|true|false|true|

##Step 1.2 Mask off bottom triangle

The table is symmetrical. I only need to work on one half of it. I've chosen to work on upper half (triangle) somewhat arbitrarily. So I want to change all values below the diagonal to `False`.

And oh, I want to change the diagonal, itself, to `False`. It has PCC values of 1 (True) that are spurious.

Numpy has a nice method for doing what I want. I found it by asking Gemini for help.

Here is your target.

<pre>
array([[False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False,  True], #one True here
       [False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False]])
</pre>

In [None]:
upper_mask =
upper_mask

##Step 1.3 Find correlated columns

I'll look at each column. If it has any True values, then it is correlated with another column or columns. I want to drop it.

You can eyeball what I have above to see that the the `Joined_Southampton` has a `True` value. So that is what I expect to find.

In [None]:
#I used a list comprehension  with enumerate in the generator. It gave me both a column name and its index in upper_mask.
#numpy has a method for looking for any True values in a matrix column.

correlated_columns =

correlated_columns  #['Joined_Southampton']

##Step 1.4 Drop correlated column(s)

We have seen how to do this before.

In [None]:
new_df =

In [None]:
set(transformed_df.columns) - set(new_df.columns)  #{'Joined_Southampton'}

##Caveat 1

The question of correlations is a deep one and we are just scratching the surface. Many popular methods do not simply drop columns, they build a brand new set of reduced columns. For your interview question prep, you should probably at least look at linear algebra methods such as PCA and SVD.

##Caveat 2

It would be interesting to compare `dcor` from chapter 1 with PCC. I would expect dcor to be better. The problem is that it will not work with NaN values so can't do it here.



##Caveat 3

One reason people look at correlations and feature reduction is to give machine learning models a less complex dataset. I am on the edge with this. It is true that in some cases reducing the number of columns can help a machine model learn. But in other cases, the model will learn to ignore columns that do not supply useful information. In the latter case, you could be wasting considerable time with a task that is unnecessary.

It is also the case that there are automated tools for doing feature reduction, taking the burden off of you. If we have time, we might discuss a few later in the course.


#Challenge 3

Build a CustomPearsonTransformer. Plug in your code from Challenge 2 into the transform method.



##A Prelude to your class

Up until now the `fit` method has done nothing and the `transform` method has done everything. I'd like you to change things up. Please have the `fit` method do most of the work. In particular, it computes the list of columns to drop (but does not drop them). The `transform` method simply drops the list of columns computed by `fit`.

Note you will have to think a bit about how to know if transform has been called before fit. I want an assertion error in that case.

In [None]:
class CustomPearsonTransformer(BaseEstimator, TransformerMixin):
    """
    A custom scikit-learn transformer that removes highly correlated features
    based on Pearson correlation.

    Parameters
    ----------
    threshold : float
        The correlation threshold above which features are considered too highly correlated
        and will be removed.

    Attributes
    ----------
    correlated_columns : Optional[List[Hashable]]
        A list of column names (which can be strings, integers, or other hashable types)
        that are identified as highly correlated and will be removed.
    """


In [None]:
#test it out
pt = CustomPearsonTransformer(.4)

In [None]:
new_df = pt.transform(transformed_df)  #AssertionError: PearsonTransformer.transform called before fit.

In [None]:
new_df = pt.fit_transform(transformed_df)  #list of columns to drop

In [None]:
set(transformed_df.columns) - set(new_df.columns)  #{'Joined_Southampton'}

###Try with lower threshold

In [None]:
#test it out
pt = CustomPearsonTransformer(.35)
new_df = pt.fit_transform(transformed_df)
set(transformed_df.columns) - set(new_df.columns)  #{'Fare', 'Joined_Southampton'}

##You can add this transformer to your library if you want.

I don't see us using it in future but good to have around.

To repeat caveat 3: my preference is to let (a) the models, themselves, or (b) tools built specifically for this job, sort out which columns are imporant and which are not.

###A few follow up methods for feature reduction

https://medium.com/@sumantabasak/know-these-already-few-powerful-feature-selection-algorithms-18a5a27c1cd3

https://towardsdatascience.com/boruta-and-shap-for-better-feature-selection-20ea97595f4a. This link in particular talks about tools built just for the job, e.g., SHAP.