In [1]:
import pandas as pd

## Matching the feature names in the two dataset 

- **Use Case 1:** Comparision of feature names in training and testing set

- **Use case 2:** Comparision of feature names of two data frames that needs to be combined for EDA 

### If the number of features is large enough in two dataframes then matching the respective names by calling column attribute on data frame objects will be a cumbersome task
- **Constraints:** 
- Considered only **two dataframes**
- Check the shape of two data frames and compare no. of features - for eg df1 and df2 are the data frames

    - If the no. of features are equal **[len(df1.columns) = len(df2.columns)]**
        - Then the order of passing these data frames(eg df1,df2) as arguments in the below User-Defined function will not impact the outcome
        
    - If the no.features differ  **[len(df1.columns) < len(df2.columns)]**
        - Then pass the data frame with less number of features as the first argument and later as the second argument 
          (df1 parameter should be the data frame which is having less number of features than df2 parameter)
          

- **Note:** Order of appearance of feature names in the two dataframes is easily handled by concat and append methods
    - If the two dataframes are combined by using **concat() or append()** methods then these methods are smart enough to figure it out to ignore the order mismatch
    - But the only condition for **order ignorance** by these 2 methods is that the **FEATURE NAMES SHOULD BE SAME in both the datasets**


In [2]:
def feature_check(df1,df2):                                 
    
    if set(df2.columns).difference(set(df1.columns))==set(): #checking for zero difference between the two dataframes
        return True                                          #if all the feature names are same then True is returned
    
    else:                                                    #Return the non matching features in df1 as well as df2
        return {'df1':set(df1.columns).difference(set(df2.columns)),'df2': set(df2.columns).difference(set(df1.columns))}
    

In [3]:
# proof of concept using dummy data
noclickDF = pd.DataFrame([[0,123,59],[0,1543,56]], columns=['click', 'id','location'])
clickDF = pd.DataFrame([[1,123,421],[1,1543,436]], columns=['click', 'location','id'])

In [4]:
feature_check(clickDF,noclickDF)

True

In [5]:
noclickDF = pd.DataFrame([[0,123,56,'M'],[0,1543,567,'F']], columns=['click', 'id','location','gender'])
clickDF = pd.DataFrame([[1,123],[1,1543]], columns=['click', 'loc'])

In [6]:
feature_check(clickDF,noclickDF)

{'df1': {'loc'}, 'df2': {'gender', 'id', 'location'}}