# Announcement

I made a big switch here. In video I use the Pima dataset from Midterm 1. But I decided it was too confusing to start messing with this dataset during the final project. It seemed a lot more consistent to stick with Titanic, so that is what I did. Steps are the same, with caveat below.

I made a change to  *Feature reduction*. I have given you a new way (new library) that is more useful than the pandas `.corr` method.


I also removed the need to save the test set. We really do not need it.

Finally, I am asking you to build the markdown (md) desription of your pipeline including a screenshot of it. Refer back end to Chapter 7 for a model of what you need. You may be able to copy and paste what I have in Chapter 7 then edit it to fit your new pipeline.

<center>
<h1>First Notebook for Final Project (Wrangling)</h1>
</center>

<hr>

I am using Titanic dataset. You will need to fill in with your own. So replace my csv, random_state and pipeline with your own.

Reminder: here is a list of places you can find data to use: https://careerfoundry.com/en/blog/data-analytics/where-to-find-free-datasets/.

# Notes on your choice of dataset

* I would like you to find a csv dataset that has a mixture of numeric and categorical columns. So you can show off your transformers!

* I'd like a binary label column, i.e., what you will predict.

* Do not worry about number of rows. If they are below 500 you can use SMOTE to increase them. If they are over 5000, you can use downsample to decrease them.

* Don't worry about number of feature columns. I can help you reduce a large number down to 10 or less.


# I'd like to know the dataset you choose. Email me a link.

Do this before starting this notebook. I may have some guidance that will help you and potentially direct you off of dead-end paths.

## Set-up

First bring in your library.

In [None]:
github_name = 'marvnc'
repo_name = 'cs523'
source_file = 'library.py'
url = f'https://raw.githubusercontent.com/{github_name}/{repo_name}/main/{source_file}'
!rm $source_file
!wget $url
%run -i $source_file

## Demo with Titanic

##Caveat

You will not match my results. You have your own dataset. I am just giving you the outline of the steps I expect you to follow.

But you will need your dataset stored in a way that you can load it (and I can load it). This may mean downloading from web (e.g., kaggle) and then uploading to GitHub.

In [None]:
url = 'https://raw.githubusercontent.com/fickas/asynch_models/main/datasets/titanic_trimmed.csv'  #trimmed version

titanic_trimmed = pd.read_csv(url)
titanic_trimmed.head()


In [None]:
len(titanic_trimmed)

# Break out into features and labels



In [None]:
titanic_features = titanic_trimmed.drop(columns='Survived')
labels = titanic_trimmed['Survived'].to_list()

In [None]:
labels.count(1)/len(labels)

# Downsampling (optional)

Here is code to reduce the number of rows while keeping the new table stratified on the target column.

You can skip this step if your table has less than 5000 rows. You could probably skip it with less than 10000. The whole point is to avoid tuning (next notebook) that takes hours. In a real setting, you would probably have to bite the bullet and wait that long.

Note that I am setting N=1000 just for demo purposes. You should use 5000 or even greater as your value if you do need to downsample.

In [None]:
N=1000  #size of table you want, really not needed for such a small table as Titanic
target = 'Survived'
original_df = titanic_trimmed.copy()

downsample_df = original_df.groupby(target, group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(original_df))))).sample(frac=1).reset_index(drop=True)
downsample_df

In [None]:
downsample_df['Survived'].to_list().count(1)/len(downsample_df)  #same as original so seems to work

### For those who want to know

Here is brief explanation:

* `original_df.groupby(target, group_keys=False)`: This groups the dataframe by the target column. The `group_keys=False` ensures that the original grouping columns are not included in the resulting dataframe.

* The apply method is used to apply a function to each group:

 * `lambda x: x.sample(int(np.rint(N*len(x)/len(original_df))))`: This function samples from each group based on the ratio of the group size to the total dataframe size multiplied by N (the number of samples you want from the entire dataframe).
 * The `np.rint` function rounds the number to the nearest integer.
 * `sample(frac=1)`: This shuffles the rows of the resulting dataframe.

* reset_index(drop=True): This resets the index of the dataframe and drops the original index.

In essence, the code calculates the number of samples needed from each group directly, based on the proportion of the group in the original dataframe, and then samples that number of rows from each group. It then shuffles the combined downsampled dataframe.

# Feature reduction (optional)

Some will need to drop columns. I'll give you a new library.  I like it because it works with a table with a mixture of datatypes, e.g., categorical columns.

You should be able to use `correlation_df` (below) in place of the code you worked on earlier that used the pandas correlation table.

You can play around with the threshold to get to your target number of features.



### New correlation method

The `titanic_trimmed.corr()` given  will give you an error if you have mixed column types. This new method handles this fine. When comparing two columns, it uses:

* Cramer's V for categorical-categorical.
* correlation ratio for numeric-categorical.
* Pearson for numeric-numeric.

In [None]:
!pip install dython
from dython.nominal import associations

# Simple one-liner to get correlations:
correlations = associations(titanic_trimmed, nominal_columns=['Joined', 'Gender', 'Class'])  #you have to tell it which are categorical
correlations

In [None]:
# Convert to dataframe
correlation_df = correlations['corr']
correlation_df

## If you downsample and/or do feature reduction

Save the resulting table out to github so you do not have to keep repeating the process.


# Define your pipeline

This is up to you for your own dataset.


In [None]:

titanic_transformer = Pipeline(steps=[
    ('map_gender', CustomMappingTransformer('Gender', {'Male': 0, 'Female': 1})),
    ('map_class', CustomMappingTransformer('Class', {'Crew': 0, 'C3': 1, 'C2': 2, 'C1': 3})),
    ('target_joined', CustomTargetTransformer(col='Joined', smoothing=10)),
    ('tukey_age', CustomTukeyTransformer(target_column='Age', fence='outer')),
    ('tukey_fare', CustomTukeyTransformer(target_column='Fare', fence='outer')),
    ('scale_age', CustomRobustTransformer(target_column='Age')),
    ('scale_fare', CustomRobustTransformer(target_column='Fare')),
    ('impute', CustomKNNTransformer(n_neighbors=5)),
    ], verbose=True)


## Take a screenshot of your pipeline

Save it on GitHub. Then use Chapter 7 markdown example to build your own pipeline description. Also save this on GitHub.

## Find random state for splitting



In [None]:
titanic_transformed_df = titanic_transformer.fit_transform(titanic_features, labels)

In [None]:
titanic_transformed_df.head()

# Find random_state value



In [None]:
%%capture
rs, _ = find_random_state(titanic_transformed_df, labels, titanic_transformer)

In [None]:
rs  #whatever it is for your own dataset

### Remember this value for next notebook, tuning

## Split your dataset and save fitted pipeline

In [None]:
X_train, X_test, y_train, y_test = train_test_split(titanic_features, labels, test_size=0.2, random_state=rs)
production_pipeline = titanic_transformer.fit(X_train, y_train)  #now ready for production

In [None]:
import joblib

joblib.dump(production_pipeline, "final_fully_fitted_pipeline.pkl")  #Move this to GitHub where you will use it in production

In [None]:
ptransformer = joblib.load("final_fully_fitted_pipeline.pkl")  # Make sure can get it back


# Congrats

You are done with the wrangling stage. Ready to move on to training and tuning models.

# Reminder

You should have the following files stored on GitHub.

1. Screenshot of pipeline.

2. Markdown pipeline description that references screenshot.

3. `final_fully_fitted_pipeline.pkl`

4. Value for `rs` to use in train_test_split. Saving this to GitHub is optional. You can just remember it and use it in tuning notebook.