# Classificação de filmes
## UFRJ 2022.2 - Introdução à Aprendizagem de Máquina

- Rafael: SVM e Naive Bayes

## Bibliotecas

In [1]:
import pandas as pd
import numpy as np

## Base de dados
A coluna "Booking_ID" foi deletada pois continha apenas os identificadores únicos das reservas, e tais valores são iguais aos índices do próprio DataFrame com que trabalharemos.

In [2]:
df = pd.read_csv('Dados/Hotel Reservations.csv')
df.drop('Booking_ID', axis = 1, inplace = True)
df.head()

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled
3,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.0,0,Canceled
4,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.5,0,Canceled


## Tratando os dados

Realizado com base nos slides da Aula 1 - Base de Dados e Outliers

### Checando se há valores nulos

In [3]:
df.isnull().sum()

no_of_adults                            0
no_of_children                          0
no_of_weekend_nights                    0
no_of_week_nights                       0
type_of_meal_plan                       0
required_car_parking_space              0
room_type_reserved                      0
lead_time                               0
arrival_year                            0
arrival_month                           0
arrival_date                            0
market_segment_type                     0
repeated_guest                          0
no_of_previous_cancellations            0
no_of_previous_bookings_not_canceled    0
avg_price_per_room                      0
no_of_special_requests                  0
booking_status                          0
dtype: int64

### Checando se há valores duplicados

In [4]:
df.duplicated().sum()

10275

In [5]:
df.drop_duplicates(inplace = True)

### Checando tipos das colunas

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26000 entries, 0 to 36273
Data columns (total 18 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          26000 non-null  int64  
 1   no_of_children                        26000 non-null  int64  
 2   no_of_weekend_nights                  26000 non-null  int64  
 3   no_of_week_nights                     26000 non-null  int64  
 4   type_of_meal_plan                     26000 non-null  object 
 5   required_car_parking_space            26000 non-null  int64  
 6   room_type_reserved                    26000 non-null  object 
 7   lead_time                             26000 non-null  int64  
 8   arrival_year                          26000 non-null  int64  
 9   arrival_month                         26000 non-null  int64  
 10  arrival_date                          26000 non-null  int64  
 11  market_segment_

Precisamos transformar a coluna "booking_status" de "object" em valores inteiros para realizarmos os métodos de classificação. Dado que as colunas "type_of_meal_plan", "room_type_reserved" e "market_segment_type" podem desempenhar um papel importante na nossa classificação, realizaremos a mesma transformação nelas.

#### Coluna booking_status

In [7]:
df['booking_status'].unique()

array(['Not_Canceled', 'Canceled'], dtype=object)

Realizando a transformação:

- Canceled $\rightarrow$ 1
- Not canceled $\rightarrow$ 0

In [8]:
df['booking_status'] = np.where(df['booking_status'] == 'Canceled', 1, 0)

#### Coluna type_of_meal_plan

In [9]:
df['type_of_meal_plan'].unique()

array(['Meal Plan 1', 'Not Selected', 'Meal Plan 2', 'Meal Plan 3'],
      dtype=object)

Realizando a transformação:

- Meal Plan 1 $\rightarrow$ 0 
- Not Selected $\rightarrow$ 1
- Meal Plan 2 $\rightarrow$ 2
- Meal Plan 3 $\rightarrow$ 3

In [10]:
# tomp = type of meal plan
tomp_original_values = df['type_of_meal_plan'].unique().tolist()
tomp_int_values = [value for value in range(len(tomp_original_values))]

In [11]:
df.replace(tomp_original_values, tomp_int_values, inplace = True)

#### Coluna room_type_reserved

In [12]:
df['room_type_reserved'].unique()

array(['Room_Type 1', 'Room_Type 4', 'Room_Type 2', 'Room_Type 6',
       'Room_Type 5', 'Room_Type 7', 'Room_Type 3'], dtype=object)

Realizando a transformação:

- Room_Type 1 $\rightarrow$ 0
- Room_Type 4 $\rightarrow$ 1
- Room_Type 2 $\rightarrow$ 2
- Room_Type 6 $\rightarrow$ 3
- Room_Type 5 $\rightarrow$ 4
- Room_Type 7 $\rightarrow$ 5
- Room_Type 3 $\rightarrow$ 6

In [13]:
# rtp = room type reserved
rtp_original_values = df['room_type_reserved'].unique().tolist()
rtp_int_values = [value for value in range(len(rtp_original_values))]

In [14]:
df.replace(rtp_original_values, rtp_int_values, inplace = True)

#### Coluna market_segment_type

In [15]:
df['market_segment_type'].unique()

array(['Offline', 'Online', 'Corporate', 'Aviation', 'Complementary'],
      dtype=object)

Realizando a transformação:

- Offline $\rightarrow$ 0
- Online $\rightarrow$ 1
- Corporate $\rightarrow$ 2
- Aviation $\rightarrow$ 3
- Complementary $\rightarrow$ 4

In [16]:
# mst = market segment type
mst_original_values = df['market_segment_type'].unique().tolist()
mst_int_values = [value for value in range(len(mst_original_values))]

In [17]:
df.replace(mst_original_values, mst_int_values, inplace = True)

In [18]:
df

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,2,0,1,2,0,0,0,224,2017,10,2,0,0,0,0,65.00,0,0
1,2,0,2,3,1,0,0,5,2018,11,6,1,0,0,0,106.68,1,0
2,1,0,2,1,0,0,0,1,2018,2,28,1,0,0,0,60.00,0,1
3,2,0,0,2,0,0,0,211,2018,5,20,1,0,0,0,100.00,0,1
4,2,0,1,1,1,0,0,48,2018,4,11,1,0,0,0,94.50,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36269,2,2,0,1,0,0,3,0,2018,10,6,1,0,0,0,216.00,0,1
36270,3,0,2,6,0,0,1,85,2018,8,3,1,0,0,0,167.80,1,0
36271,2,0,1,3,0,0,0,228,2018,10,17,1,0,0,0,90.95,2,1
36272,2,0,2,6,0,0,0,148,2018,7,1,1,0,0,0,98.39,2,0
