# **Data Preperation**

1. [Handling Missing values](#dealing-with-agent)
2. [Feature Encoding Categorical Variables](#feature-encoding-categorical-variables)
3. [Handling Skewness](#handling-skewness)
4. [Correlation between Dependent and Independent Variables](#correlation-between-dependent-and-independent-variables)
5. [Removing Irralvant Variables](#removing-irralvant-variables)

### Loading DataSet

### Importing Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("hotel_booking.csv")

df.head(2)

## **Handling Missing Values**

### Getting Missing Values

In [None]:
cols_nan = [col for col in df.columns if df[col].isnull().sum() != 0]
percs = []
for col in cols_nan:
    perc = df[col].isnull().sum() / df.shape[0] * 100
    percs.append(percs)
    print("{} : {}".format(col, perc))
    

Form Above Analysis
- "company" has more than ***94%*** missing values.Therefore, we have not enough to the rows or impute the company column by mean, median, mode etc. Hence we can drop "company" column.
- "agent" has **13.69%** missing values.It is travel agency Id and these values are unique and we cannot impute Id by mean, median or mode.Since, missing values are almost **13%** of all data we can't drop them.Therefore missing data can by filled by *0*.
- "country" has almost **0.4%** missing values.Since missing data is less than **1%**, we can impute them with mode.
- "children" has only a few missing values and we can fill these values by *0* considering guests have no children.

### Dropping "company"

In [None]:
df.drop(columns=["company"], inplace=True)
df.head(1)

### Dealing with "agent"

In [None]:
df["agent"].fillna(0, inplace=True)
df.agent.isnull().sum()

### Dealing with "country"

In [None]:
df["country"].fillna(df["country"].mode()[0], inplace=True)
df["country"].isnull().sum()

### Dealing with "children"

In [None]:
df["children"].fillna(0, inplace=True)
df["children"].isnull().sum()

## **Feature Encoding Categorical Variables**

In [None]:
cat_features = [feature for feature in df.columns if df[feature].dtype == 'object']
print(f"Ther are {len(cat_features)} categorical features in Dataset")

In [None]:
cat_features

### Counting Unique Values

In [None]:
for feature in cat_features:
    print(f"{feature}   :   {len(df[feature].unique())} unique values")

### Custom Encoding

In [None]:
df["hotel"]  = df["hotel"].map({"City Hotel": 0, "Resort Hotel":1})

### Encoding Using `sklearn`

In [None]:
from sklearn.preprocessing import LabelEncoder

# features to encode
features = ('arrival_date_month', 'meal','country', 'market_segment', 'distribution_channel', 'reserved_room_type',
'assigned_room_type', 'deposit_type', 'customer_type', 'reservation_status', 'reservation_status_date')

en = LabelEncoder()
for feature in features:
    df[feature] = en.fit_transform(df[feature])

df.head(2)

In [None]:
cleaned_df = df.copy()

## **Handling Skewness**

## **Correlation between Dependent and Independent Variables**

## **Removing Irralvant Variables**

[Top](#)