<center>
    <h1 id='data-science-tricks-2' style='color:#7159c1'>🎩 Data Science Tricks 2 🎩</h1>
    <i>Getting better visualization insights and transformations of you dataset</i>
</center>

```txt
- Feature Selection
- Pandas in Parallel
```

In [1]:
# ---- Imports ----
import pandas as pd

# ---- Constants ----
DATASETS_PATH = ('./datasets')
SEED = (20240707)

<p id='0-feature-selection' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Feature Selection</p>

`SelectFromModel` is a sklearn feature selection function that helps you in choosing the most important features to your model based on a base model.

Documentation: [SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html).

In [2]:
# ---- Reading Dataset ----
titanic_df = pd.read_csv(f'{DATASETS_PATH}/titanic.csv')

print(f'- Observations: {titanic_df.shape[0]:,}')
print(f'- Variables: {titanic_df.shape[1]:,}')
print('---')

titanic_df.head()

- Observations: 891
- Variables: 25
---


Unnamed: 0,Survived,Age,Fare,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,SibSp_0,SibSp_1,...,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6,Embarked_C,Embarked_Q,Embarked_S
0,0,22.0,7.25,0,0,1,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,38.0,71.2833,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,1,26.0,7.925,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
3,1,35.0,53.1,1,0,0,1,0,0,1,...,1,0,0,0,0,0,0,0,0,1
4,0,35.0,8.05,0,0,1,0,1,1,0,...,1,0,0,0,0,0,0,0,0,1


In [3]:
# ---- Splitting Dataset into Train and Validation ----
from sklearn.model_selection import train_test_split

X = titanic_df.drop(['Survived'], axis=1).copy() # axis '0': by row; axis '1': by column
y = titanic_df['Survived'].copy()

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y
    , train_size=0.80
    , test_size=0.20
    , stratify=y
    , random_state=SEED
)

print(f'- Y Train:\n{y_train.value_counts(normalize=True)}') # normalize 'True' returns percentages, 'False' returns frequencies
print('\n---\n')
print(f'- Y Valid:\n{y_valid.value_counts(normalize=True)}')

- Y Train:
0    0.616573
1    0.383427
Name: Survived, dtype: float64

---

- Y Valid:
0    0.614525
1    0.385475
Name: Survived, dtype: float64


In [4]:
# ---- Feature Selection ----
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

base_model = LassoCV(random_state=SEED)

select_from_model = SelectFromModel(base_model, threshold='mean')
select_from_model.fit(X_train, y_train)

number_selected_features = select_from_model.transform(X_train).shape[1]
indices_selected_features = select_from_model.get_support(indices=True)
names_selected_features = select_from_model.get_feature_names_out()

print(f'- Number of Selected Features: {number_selected_features}')
print(f'- Indices of Selected Features: {indices_selected_features}')
print(f'- Names of Selected Features: {names_selected_features}')

- Number of Selected Features: 4
- Indices of Selected Features: [ 2  4  5 23]
- Names of Selected Features: ['Pclass_1' 'Pclass_3' 'Sex_female' 'Embarked_S']


<p id='1-pandas-in-parallel' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Pandas in Parallel</p>

`Pandarallel` is a great package for running pandas transformations in parallel. Let's see how it works and its impacts of time.

Documentation: [pandarallel 1.6.5](https://pypi.org/project/pandarallel/).

Important Miscellaneous: [docs/examples_windows.ipynb](https://github.com/nalepae/pandarallel/blob/master/docs/examples_windows.ipynb).

In [5]:
%load_ext autoreload
%autoreload 2
# ---- Preparing Dataset ----
from pandarallel import pandarallel
import random
from tqdm.notebook import tqdm_notebook

tqdm_notebook.pandas()
pandarallel.initialize(progress_bar=True)
random.seed(SEED)

def transformation(df):
    import numpy as np
    return np.sin(df.A**2) + np.sin(df.B**2) + np.tan(df.A**2)

df = pd.DataFrame({
    'A': [random.randint(15,20) for index in range(1, 1_000_000)]
    , 'B': [random.randint(10,30) for index in range(1, 1_000_000)]
})

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

https://nalepae.github.io/pandarallel/troubleshooting/


In [6]:
%%time
# ---- Without Parallelization ----
df.progress_apply(transformation, axis=1)

  0%|          | 0/999999 [00:00<?, ?it/s]

CPU times: total: 35.6 s
Wall time: 35.7 s


0          1.705438
1          0.164429
2         23.226399
3         -0.216171
4         -0.502609
            ...    
999994    -0.773902
999995    23.621330
999996    -0.491467
999997     1.036482
999998    -0.823155
Length: 999999, dtype: float64

In [7]:
%%time
# ---- With Parallelization ----
df.parallel_apply(transformation, axis=1)

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=250000), Label(value='0 / 250000')…

CPU times: total: 438 ms
Wall time: 14.1 s


0          1.705438
1          0.164429
2         23.226399
3         -0.216171
4         -0.502609
            ...    
999994    -0.773902
999995    23.621330
999996    -0.491467
999997     1.036482
999998    -0.823155
Length: 999999, dtype: float64

<p id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</p>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).