## PART C: FEATURE ENGINEERING

Notes on Feature Engineering. How to deal with categorical variables.

There are 2 types of categorical variables

- Nominal- No order or hierarchy. eg gender of genre

- Ordinal- There is some sense of hierarchy eg education level

### Dealing with Nominal categorical Data

Consider a movie dataset with a Genre column. To map the genre names to values, 


```
from sklearn.preprocessing import LabelEncoder

gle = LabelEncoder()

genre_labels = gle.fit_transform(movies['Genre'])

genre_mappings = {index: label for index, label in 
                  enumerate(gle.classes_)}
genre_mappings
```
Output
---------------------------------------------------------------
{0: 'Action', 1: 'Adventure', 2: 'Fighting', 3: 'Misc',
 4: 'Platform', 5: 'Puzzle'}



### Dealing with Ordinal Categorical Data

Consider a dataset with vintage cars from the first generation to the latest
To assign values to the categorical variable names, you generate a mapping

```
gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3, 
               'Gen 4': 4, }
vintage_cars['GenerationLabel'] = vintage_cars['Generation'].map(gen_ord_map)

```
Alternative coding approach

```
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
gen_le = LabelEncoder()
gen_labels = gen_le.fit_transform(vintage_cars['Generation'])
vintage_cars['Gen_Label'] = gen_labels
```

### One Hot Encoding to numerical labels
```
# encode generation labels using one-hot encoding scheme
gen_ohe = OneHotEncoder()
gen_feature_arr = gen_ohe.fit_transform(
                              vintage_cars[['Gen_Label']]).toarray()
gen_feature_labels = list(gen_le.classes_)
gen_features = pd.DataFrame(gen_feature_arr, 
                            columns=gen_feature_labels)
                          

```
###  Using dummies()

```
gen_onehot_features = pd.get_dummies(vintage_cars['Generation'])
```
### Dummy Coding scheme VS Onehote encoding scheme
Similar to the one-hot encoding scheme, except in the case of dummy coding scheme, when applied on a categorical feature with m distinct labels, we get m - 1 binary features. Thus each value of the categorical variable gets converted into a vector of size m - 1. The extra feature is completely disregarded and thus if the category values range from {0, 1, …, m-1} the 0th or the m - 1th feature column is dropped and corresponding category values are usually represented by a vector of all zeros (0).

### Curse of Dimensionality
 Preventable using methods such as feature hashing and bin c

In [11]:
# %Loading clean data
import pandas as pd
df = pd.read_csv("../Data/Clean_Tanzania_Tourism_datasets.csv")
df.head()

Unnamed: 0,ID,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,...,package_transport_tz,package_sightseeing,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost
0,tour_0,SWIZERLAND,45-64,Friends/Relatives,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Friends, relatives",Independent,...,No,No,No,No,13.0,0.0,Cash,No,Friendly People,674602.5
1,tour_10,UNITED KINGDOM,25-44,Alone,1.0,0.0,Leisure and Holidays,Cultural tourism,others,Independent,...,No,No,No,No,14.0,7.0,Cash,Yes,"Wonderful Country, Landscape, Nature",3214906.5
2,tour_1000,UNITED KINGDOM,25-44,Alone,0.0,1.0,Visiting Friends and Relatives,Cultural tourism,"Friends, relatives",Independent,...,No,No,No,No,1.0,31.0,Cash,No,Excellent Experience,3315000.0
3,tour_1002,UNITED KINGDOM,25-44,Spouse,1.0,1.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Package Tour,...,Yes,Yes,Yes,No,11.0,0.0,Cash,Yes,Friendly People,7790250.0
4,tour_1004,CHINA,1-24,Alone,1.0,0.0,Leisure and Holidays,Wildlife tourism,"Travel, agent, tour operator",Independent,...,No,No,No,No,7.0,4.0,Cash,Yes,No comments,1657500.0


### DATA TYPES CONVERSION

To make sure number of male , number of female , and all other features supposed to be integer ,should be converted to be int, help to bring the problrem into reality.

In [14]:
# %convert float dtypes to int[total_female,total_male,night_mainland,night_zanzibar]
df["total_female"] = df['total_female'].astype('int')
df["total_male"] = df['total_male'].astype('int')
df["night_mainland"] = df['night_mainland'].astype('int')
df["night_zanzibar"] = df['night_zanzibar'].astype('int')

### FEATURE GENERATION

In [15]:
# %Let's generate new features from some columns which makes some sense
df["total_people"] = df["total_female"] + df["total_male"]

df["total_nights"] = df["night_mainland"] + df["night_zanzibar"]

In [17]:
# %more information about data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 25 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     4809 non-null   object 
 1   country                4809 non-null   object 
 2   age_group              4809 non-null   object 
 3   travel_with            4809 non-null   object 
 4   total_female           4809 non-null   int64  
 5   total_male             4809 non-null   int64  
 6   purpose                4809 non-null   object 
 7   main_activity          4809 non-null   object 
 8   info_source            4809 non-null   object 
 9   tour_arrangement       4809 non-null   object 
 10  package_transport_int  4809 non-null   object 
 11  package_accomodation   4809 non-null   object 
 12  package_food           4809 non-null   object 
 13  package_transport_tz   4809 non-null   object 
 14  package_sightseeing    4809 non-null   object 
 15  pack

### ENCODING OBJECT FEATURES

In [18]:
# %Before hand let's remove ID Column
df.drop('ID', axis='columns', inplace=True)

In [20]:
# %then it's time to encode objects into numeric

for colname in df.select_dtypes("object"):
    df[colname],_=df[colname].factorize()

In [21]:
df.head()

Unnamed: 0,country,age_group,travel_with,total_female,total_male,purpose,main_activity,info_source,tour_arrangement,package_transport_int,...,package_guided_tour,package_insurance,night_mainland,night_zanzibar,payment_mode,first_trip_tz,most_impressing,total_cost,total_people,total_nights
0,0,0,0,1,1,0,0,0,0,0,...,0,0,13,0,0,0,0,674602.5,2,13
1,1,1,1,1,0,0,1,1,0,0,...,0,0,14,7,0,1,1,3214906.5,1,21
2,1,1,1,0,1,1,1,0,0,0,...,0,0,1,31,0,0,2,3315000.0,1,32
3,1,1,2,1,1,0,0,2,1,0,...,1,0,11,0,0,1,0,7790250.0,2,11
4,2,2,1,1,0,0,0,2,0,0,...,0,0,7,4,0,1,3,1657500.0,1,11


### SAVE DATA

In [23]:
df.to_csv("../Data/Final_data.csv",index=False)