Got KeyError on date/string columns [featurewiz 0.1.99] #48

eromoe · 2022-08-29T03:40:10Z

Hello , after update to featurewiz 0.1.99 , I got different error .

Code is

from featurewiz import FeatureWiz
features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
X= features.fit_transform(X, y)
features.features  ### provides the list of selected features ###

traceback:

KeyError                                  Traceback (most recent call last)
Input In [71], in <cell line: 1>()
      8 from featurewiz import FeatureWiz
      9 features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
---> 10 X = features.fit_transform(X, y)
     11 cols = features.features  ### provides the list of selected features ###
     12 print(features.features)

File ~\anaconda3\lib\site-packages\sklearn\base.py:870, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    867     return self.fit(X, **fit_params).transform(X)
    868 else:
    869     # fit method of arity 2 (supervised transformation)
--> 870     return self.fit(X, y, **fit_params).transform(X)

File ~\anaconda3\lib\site-packages\featurewiz\featurewiz.py:2934, in FeatureWiz.fit(self, X, y)
   2931     return {}, {}
   2932 #### Send target variable as it is so that y_train is analyzed properly ###
   2933 # Select features using featurewiz
-> 2934 features, X_sel = featurewiz(df, target, self.corr_limit, self.verbose, self.sep,
   2935         self.header, self.test_data, self.feature_engg, self.category_encoders,
   2936         self.dask_xgboost_flag, self.nrows)
   2937 # Convert the remaining column names back to integers and drop the
   2938 difftime = max(1, int(time.time()-start_time))

File ~\anaconda3\lib\site-packages\featurewiz\featurewiz.py:1101, in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
   1099     print('Since %s category encoding is done, dropping original categorical vars from predictors...' %feature_gen)
   1100     preds = left_subtract(preds, catvars)
-> 1101 train_p = train[preds]
   1102 if train_p.shape[1] <= 10:
   1103     iter_limit = 2

File ~\anaconda3\lib\site-packages\pandas\core\frame.py:3511, in DataFrame.__getitem__(self, key)
   3509     if is_iterator(key):
   3510         key = list(key)
-> 3511     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3513 # take() does not accept boolean indexers
   3514 if getattr(indexer, "dtype", None) == bool:

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5782, in Index._get_indexer_strict(self, key, axis_name)
   5779 else:
   5780     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5782 self._raise_if_missing(keyarr, indexer, axis_name)
   5784 keyarr = self.take(indexer)
   5785 if isinstance(key, Index):
   5786     # GH 42790 - Preserve name from an Index

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5845, in Index._raise_if_missing(self, key, indexer, axis_name)
   5842     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5844 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 5845 raise KeyError(f"{not_found} not in index")

KeyError: "['network_type__first', 'device_model__first', 'ad_account__first', 'os_version__first', 'carrier__first', 'reg_week_day', 'os__first', 'hour__first', 'ad_source__first', 'ad_serving_user_group__first', 'firstecpm__first', 'province__first', 'manufacturer__first'] not in index"

These error columns are date/string type .

The text was updated successfully, but these errors were encountered:

AutoViML · 2022-08-30T01:00:37Z

Hi @eromoe :
I am not able to reproduce the error. Can you drop a small snippet from your data into a zip file here? I will then troubleshoot it. 👍
Thanks
AutoVimal

eromoe · 2022-08-30T07:08:11Z

After I trim X to top 2000, these feature couldn't be treated as importance , but the whole dataset is too large . So I debug it myself and this pr can fix it : #49

And I forgot to post the message print from featurewiz :
those cat features are all error_columns :

Readying dataset for Recursive XGBoost by converting all features to numeric...
    error converting province__first column from string to numeric. Continuing...
    error converting firstecpm__first column from string to numeric. Continuing...
    error converting network_type__first column from string to numeric. Continuing...
    error converting ad_serving_user_group__first column from string to numeric. Continuing...
    error converting reg_week_day column from string to numeric. Continuing...
    error converting ad_source__first column from string to numeric. Continuing...
    error converting os_version__first column from string to numeric. Continuing...
    error converting ad_account__first column from string to numeric. Continuing...
    error converting device_model__first column from string to numeric. Continuing...
    error converting manufacturer__first column from string to numeric. Continuing...
    error converting carrier__first column from string to numeric. Continuing...
    error converting os__first column from string to numeric. Continuing...
    error converting hour__first column from string to numeric. Continuing...
    removing 13 object columns that could not be converted to numeric
Shape of train data after pruning = (82921, 1010)

eromoe · 2022-08-30T07:21:17Z

I tried to create a sample data :

a = np.random.randint(0,3, (X.shape[0], 100))
b = pd.DataFrame(data=a, columns=map(lambda x: f'c_{x}', np.arange(100)))
b['os_version__first'] = X['os_version__first'].values
b.to_csv('test.csv', header=True, index=False)

from featurewiz import FeatureWiz
features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
c = features.fit_transform(b, y)
cols = features.features  ### provides the list of selected features ###
print(features.features)

But got another error:

wiz = FeatureWiz(verbose=1)
        X_train_selected = wiz.fit_transform(X_train, y_train)
        X_test_selected = wiz.transform(X_test)
        wiz.features  ### provides a list of selected features ###            
        
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
f:\md\jupyter_pipeline\pj01\1.1.0 clean_data.ipynb Cell 138 in <cell line: 7>()
      [5](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=4) from featurewiz import FeatureWiz
      [6](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=5) features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
----> [7](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=6) c = features.fit_transform(b, y)
      [8](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=7) cols = features.features  ### provides the list of selected features ###
      [9](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=8) print(features.features)

File c:\Users\Kasim\anaconda3\lib\site-packages\sklearn\base.py:702, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    699     return self.fit(X, **fit_params).transform(X)
    700 else:
    701     # fit method of arity 2 (supervised transformation)
--> 702     return self.fit(X, y, **fit_params).transform(X)

File c:\Users\Kasim\anaconda3\lib\site-packages\featurewiz\featurewiz.py:2921, in FeatureWiz.fit(self, X, y)
   2919 else:
   2920     df = pd.concat([X.reset_index(drop=True), y], axis=1)
-> 2921     df.index = X_index
   2922 ### Now you can process the X and y datasets ####
   2923 if isinstance(y, pd.Series):

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\core\generic.py:5588, in NDFrame.__setattr__(self, name, value)
   5586 try:
   5587     object.__getattribute__(self, name)
-> 5588     return object.__setattr__(self, name, value)
   5589 except AttributeError:
   5590     pass

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\_libs\properties.pyx:70, in pandas._libs.properties.AxisProperty.__set__()

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\core\generic.py:769, in NDFrame._set_axis(self, axis, labels)
    767 def _set_axis(self, axis: int, labels: Index) -> None:
    768     labels = ensure_index(labels)
--> 769     self._mgr.set_axis(axis, labels)
    770     self._clear_item_cache()

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\core\internals\managers.py:214, in BaseBlockManager.set_axis(self, axis, new_labels)
    212 def set_axis(self, axis: int, new_labels: Index) -> None:
    213     # Caller is responsible for ensuring we have an Index object.
--> 214     self._validate_set_axis(axis, new_labels)
    215     self.axes[axis] = new_labels

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\core\internals\base.py:69, in DataManager._validate_set_axis(self, axis, new_labels)
     66     pass
     68 elif new_len != old_len:
---> 69     raise ValueError(
     70         f"Length mismatch: Expected axis has {old_len} elements, new "
     71         f"values have {new_len} elements"
     72     )

ValueError: Length mismatch: Expected axis has 165842 elements, new values have 82921 elements

sample data:

test.csv

AutoViML · 2022-08-30T12:16:04Z

Hi @eromoe 👍
Thanks for the pull request. I don't see an error in my notebook when I run your test.csv file.
Featurewiz_Simple-Copy1.zip
Can you please find out why you are getting this error?
Thanks

eromoe · 2022-09-07T05:04:43Z

@AutoViML Today I debug and found the error converting problem .

It his because your label encode return different type of result in different case .

Because return of tuple , here got an error , and you don't print that error out which I am not very agree with ..

eromoe mentioned this issue Aug 30, 2022

remove error_columns from preds #49

Merged

AutoViML closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Got KeyError on date/string columns [featurewiz 0.1.99] #48

Got KeyError on date/string columns [featurewiz 0.1.99] #48

eromoe commented Aug 29, 2022 •

edited

Loading

AutoViML commented Aug 30, 2022

eromoe commented Aug 30, 2022

eromoe commented Aug 30, 2022 •

edited

Loading

AutoViML commented Aug 30, 2022 •

edited

Loading

eromoe commented Sep 7, 2022 •

edited

Loading

Got KeyError on date/string columns [featurewiz 0.1.99] #48

Got KeyError on date/string columns [featurewiz 0.1.99] #48

Comments

eromoe commented Aug 29, 2022 • edited Loading

AutoViML commented Aug 30, 2022

eromoe commented Aug 30, 2022

eromoe commented Aug 30, 2022 • edited Loading

AutoViML commented Aug 30, 2022 • edited Loading

eromoe commented Sep 7, 2022 • edited Loading

eromoe commented Aug 29, 2022 •

edited

Loading

eromoe commented Aug 30, 2022 •

edited

Loading

AutoViML commented Aug 30, 2022 •

edited

Loading

eromoe commented Sep 7, 2022 •

edited

Loading