Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got KeyError on date/string columns [featurewiz 0.1.99] #48

Closed
eromoe opened this issue Aug 29, 2022 · 5 comments
Closed

Got KeyError on date/string columns [featurewiz 0.1.99] #48

eromoe opened this issue Aug 29, 2022 · 5 comments

Comments

@eromoe
Copy link
Contributor

eromoe commented Aug 29, 2022

Hello , after update to featurewiz 0.1.99 , I got different error .

Code is

from featurewiz import FeatureWiz
features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
X= features.fit_transform(X, y)
features.features  ### provides the list of selected features ###

traceback:

KeyError                                  Traceback (most recent call last)
Input In [71], in <cell line: 1>()
      8 from featurewiz import FeatureWiz
      9 features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
---> 10 X = features.fit_transform(X, y)
     11 cols = features.features  ### provides the list of selected features ###
     12 print(features.features)

File ~\anaconda3\lib\site-packages\sklearn\base.py:870, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    867     return self.fit(X, **fit_params).transform(X)
    868 else:
    869     # fit method of arity 2 (supervised transformation)
--> 870     return self.fit(X, y, **fit_params).transform(X)

File ~\anaconda3\lib\site-packages\featurewiz\featurewiz.py:2934, in FeatureWiz.fit(self, X, y)
   2931     return {}, {}
   2932 #### Send target variable as it is so that y_train is analyzed properly ###
   2933 # Select features using featurewiz
-> 2934 features, X_sel = featurewiz(df, target, self.corr_limit, self.verbose, self.sep,
   2935         self.header, self.test_data, self.feature_engg, self.category_encoders,
   2936         self.dask_xgboost_flag, self.nrows)
   2937 # Convert the remaining column names back to integers and drop the
   2938 difftime = max(1, int(time.time()-start_time))

File ~\anaconda3\lib\site-packages\featurewiz\featurewiz.py:1101, in featurewiz(dataname, target, corr_limit, verbose, sep, header, test_data, feature_engg, category_encoders, dask_xgboost_flag, nrows, **kwargs)
   1099     print('Since %s category encoding is done, dropping original categorical vars from predictors...' %feature_gen)
   1100     preds = left_subtract(preds, catvars)
-> 1101 train_p = train[preds]
   1102 if train_p.shape[1] <= 10:
   1103     iter_limit = 2

File ~\anaconda3\lib\site-packages\pandas\core\frame.py:3511, in DataFrame.__getitem__(self, key)
   3509     if is_iterator(key):
   3510         key = list(key)
-> 3511     indexer = self.columns._get_indexer_strict(key, "columns")[1]
   3513 # take() does not accept boolean indexers
   3514 if getattr(indexer, "dtype", None) == bool:

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5782, in Index._get_indexer_strict(self, key, axis_name)
   5779 else:
   5780     keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
-> 5782 self._raise_if_missing(keyarr, indexer, axis_name)
   5784 keyarr = self.take(indexer)
   5785 if isinstance(key, Index):
   5786     # GH 42790 - Preserve name from an Index

File ~\anaconda3\lib\site-packages\pandas\core\indexes\base.py:5845, in Index._raise_if_missing(self, key, indexer, axis_name)
   5842     raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   5844 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
-> 5845 raise KeyError(f"{not_found} not in index")

KeyError: "['network_type__first', 'device_model__first', 'ad_account__first', 'os_version__first', 'carrier__first', 'reg_week_day', 'os__first', 'hour__first', 'ad_source__first', 'ad_serving_user_group__first', 'firstecpm__first', 'province__first', 'manufacturer__first'] not in index"

These error columns are date/string type .

@AutoViML
Copy link
Owner

Hi @eromoe :
I am not able to reproduce the error. Can you drop a small snippet from your data into a zip file here? I will then troubleshoot it. 👍
Thanks
AutoVimal

@eromoe
Copy link
Contributor Author

eromoe commented Aug 30, 2022

After I trim X to top 2000, these feature couldn't be treated as importance , but the whole dataset is too large . So I debug it myself and this pr can fix it : #49


And I forgot to post the message print from featurewiz :
those cat features are all error_columns :

Readying dataset for Recursive XGBoost by converting all features to numeric...
    error converting province__first column from string to numeric. Continuing...
    error converting firstecpm__first column from string to numeric. Continuing...
    error converting network_type__first column from string to numeric. Continuing...
    error converting ad_serving_user_group__first column from string to numeric. Continuing...
    error converting reg_week_day column from string to numeric. Continuing...
    error converting ad_source__first column from string to numeric. Continuing...
    error converting os_version__first column from string to numeric. Continuing...
    error converting ad_account__first column from string to numeric. Continuing...
    error converting device_model__first column from string to numeric. Continuing...
    error converting manufacturer__first column from string to numeric. Continuing...
    error converting carrier__first column from string to numeric. Continuing...
    error converting os__first column from string to numeric. Continuing...
    error converting hour__first column from string to numeric. Continuing...
    removing 13 object columns that could not be converted to numeric
Shape of train data after pruning = (82921, 1010)

@eromoe
Copy link
Contributor Author

eromoe commented Aug 30, 2022

I tried to create a sample data :

a = np.random.randint(0,3, (X.shape[0], 100))
b = pd.DataFrame(data=a, columns=map(lambda x: f'c_{x}', np.arange(100)))
b['os_version__first'] = X['os_version__first'].values
b.to_csv('test.csv', header=True, index=False)

from featurewiz import FeatureWiz
features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
c = features.fit_transform(b, y)
cols = features.features  ### provides the list of selected features ###
print(features.features)

But got another error:

wiz = FeatureWiz(verbose=1)
        X_train_selected = wiz.fit_transform(X_train, y_train)
        X_test_selected = wiz.transform(X_test)
        wiz.features  ### provides a list of selected features ###            
        
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
f:\md\jupyter_pipeline\pj01\1.1.0 clean_data.ipynb Cell 138 in <cell line: 7>()
      [5](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=4) from featurewiz import FeatureWiz
      [6](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=5) features = FeatureWiz(corr_limit=0.70, feature_engg='', category_encoders='', dask_xgboost_flag=False, nrows=None, verbose=2)
----> [7](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=6) c = features.fit_transform(b, y)
      [8](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=7) cols = features.features  ### provides the list of selected features ###
      [9](vscode-notebook-cell:/f%3A/md/jupyter_pipeline/pj01/1.1.0%20clean_data.ipynb#Y333sZmlsZQ%3D%3D?line=8) print(features.features)

File c:\Users\Kasim\anaconda3\lib\site-packages\sklearn\base.py:702, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    699     return self.fit(X, **fit_params).transform(X)
    700 else:
    701     # fit method of arity 2 (supervised transformation)
--> 702     return self.fit(X, y, **fit_params).transform(X)

File c:\Users\Kasim\anaconda3\lib\site-packages\featurewiz\featurewiz.py:2921, in FeatureWiz.fit(self, X, y)
   2919 else:
   2920     df = pd.concat([X.reset_index(drop=True), y], axis=1)
-> 2921     df.index = X_index
   2922 ### Now you can process the X and y datasets ####
   2923 if isinstance(y, pd.Series):

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\core\generic.py:5588, in NDFrame.__setattr__(self, name, value)
   5586 try:
   5587     object.__getattribute__(self, name)
-> 5588     return object.__setattr__(self, name, value)
   5589 except AttributeError:
   5590     pass

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\_libs\properties.pyx:70, in pandas._libs.properties.AxisProperty.__set__()

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\core\generic.py:769, in NDFrame._set_axis(self, axis, labels)
    767 def _set_axis(self, axis: int, labels: Index) -> None:
    768     labels = ensure_index(labels)
--> 769     self._mgr.set_axis(axis, labels)
    770     self._clear_item_cache()

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\core\internals\managers.py:214, in BaseBlockManager.set_axis(self, axis, new_labels)
    212 def set_axis(self, axis: int, new_labels: Index) -> None:
    213     # Caller is responsible for ensuring we have an Index object.
--> 214     self._validate_set_axis(axis, new_labels)
    215     self.axes[axis] = new_labels

File c:\Users\Kasim\anaconda3\lib\site-packages\pandas\core\internals\base.py:69, in DataManager._validate_set_axis(self, axis, new_labels)
     66     pass
     68 elif new_len != old_len:
---> 69     raise ValueError(
     70         f"Length mismatch: Expected axis has {old_len} elements, new "
     71         f"values have {new_len} elements"
     72     )

ValueError: Length mismatch: Expected axis has 165842 elements, new values have 82921 elements

sample data:

test.csv

@AutoViML
Copy link
Owner

AutoViML commented Aug 30, 2022

Hi @eromoe 👍
Thanks for the pull request. I don't see an error in my notebook when I run your test.csv file.
Featurewiz_Simple-Copy1.zip
Can you please find out why you are getting this error?
Thanks

@AutoViML AutoViML closed this as completed Sep 1, 2022
@eromoe
Copy link
Contributor Author

eromoe commented Sep 7, 2022

@AutoViML Today I debug and found the error converting problem .

It his because your label encode return different type of result in different case .
image

Because return of tuple , here got an error , and you don't print that error out which I am not very agree with ..

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants