## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 5
---------------------------------------

GOALS:

1. Answer all the questions from the lecture.
2. Practice wrangling data
3. (challenge) do a mini data analysis of your own!

----------------------------------------------------------

This homework has **7 questions** from the lecture and **2 Problems**.

NOTE:
* Questions tend to be short answer or things were you change very minor parts of some given code.
* Problems tend to be more involved, like trying out our methods on a new column.
* Challenge problems are optional! They are intended to challenge you to reach beyond the basics of the class. I hope you will try the challenge problems!

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

from itables import show
import csv

### Load the raw data:

* Data from two hotels: one resort and one city hotel
* Observations: Each row represents a hotel booking
* Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be canceled [Antonia et al., 2019](https://www.sciencedirect.com/science/article/pii/S2352340918315191#bib5)


[Data Information: Tidy Tuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md)

In [3]:
file_location = '/Users/diegocardenas/Desktop/Data CSV/hotels.csv'
DF_raw_hotels = pd.read_csv(file_location)

In [4]:
show(DF_raw_hotels)

0
Loading ITables v2.5.2 from the internet...  (need help?)


**Q1** Can you figure out how you would focus on multiple columns? We have done this before so see if you can remember the command. It's okay if you need to copy and paste!

In [5]:
DF_raw_hotels.keys()

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

In [6]:
# Your code here
variables = ['hotel', 'arrival_date_month', 'adults','lead_time', 'children']
DF_raw_hotels = DF_raw_hotels[variables]
show(DF_raw_hotels)

0
Loading ITables v2.5.2 from the internet...  (need help?)


#### Sorting Example - Descending order

In [7]:
my_columns = ['hotel','adults','lead_time',]

DF_lead_time_sorted = DF_raw_hotels[my_columns].sort_values('lead_time', ascending=False)
show(DF_lead_time_sorted)

0
Loading ITables v2.5.2 from the internet...  (need help?)


**Q2** Choose another set of variables (columns) that you can explore. Try sorting them ascending and descending.

In [12]:
DF_raw_hotels.keys()

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')

In [14]:
# Your code here:
new_col = ['hotel','arrival_date_week_number', 'reservation_status_date', 'adults', 'children'] 
DF_raw_hotels = pd.read_csv(file_location)
DF_reserve_sorted = DF_raw_hotels[new_col].sort_values('reservation_status_date', ascending=True)
show(DF_reserve_sorted)

0
Loading ITables v2.5.2 from the internet...  (need help?)


**Q3** What would happen if you tried to sort categorical data? Try sorting by the column "arrival_date_month". Do you get an error? Does it sort it? If so how is it sorted? <br><br>

1. When sorting arrival_date_month, we see that the code is ran but the problem is that it is showing only 2 months. April and September is being shown and none of the other months in the CSV file and not only that but April and September are not even in the file. It is showing data that is not in the file to begin with.

In [16]:
# Your code here:
new_col = ['arrival_date_month']

DF_arrival_sorted = DF_raw_hotels[new_col].sort_values('arrival_date_month', ascending=True)
show(DF_arrival_sorted)

0
Loading ITables v2.5.2 from the internet...  (need help?)


### Boolean Mask Example - Subsellect Data

Here is an example of focusing only on 'Resort Hotel' data in the 'hotel' column

In [17]:
mask = DF_raw_hotels['hotel'] == 'Resort Hotel'
show(DF_raw_hotels[mask])

0
Loading ITables v2.5.2 from the internet...  (need help?)


**Problem 1** Can you create a data frame that:

1. Selects the columns "hotel", "year", and "month"
2. Sorts by year ascending.
3. Masks to only show the month of "August"

In [18]:
my_col = ['hotel', 'arrival_date_year', 'arrival_date_month']
DF_year_sorted = DF_raw_hotels[my_col].sort_values('arrival_date_year',ascending=True)
mask = DF_year_sorted['arrival_date_month'] == 'August'
show(DF_year_sorted[mask])

0
Loading ITables v2.5.2 from the internet...  (need help?)


#### Basic Operators

| Operator | Definition |
|----------|------------|
|<|less than|
|>|greater than|
|<=| less than or equal to|
|>=| greater than or equal to|
|==| exactly equal to|
|!=| not equal to|

#### Advanced Operators

| Operator | Definition |
|----------|------------|
| and | check if two things are both true|
| or | check if one of two things is true|
| in | checks if something is in another thing|
| ! | not checks if something is false|

**Q4** Try to say whether each cell will print true or false before running the code:

In [19]:
3<10

True

In [20]:
10<3

False

In [21]:
3<10 and 2<10

True

In [22]:
'cat' == 'cat'

True

In [23]:
'CAT'=='cat'

False

In [24]:
'CAT'!='cat'

True

In [25]:
2<=2

True

In [26]:
2<=2 or 2<1

True

In [27]:
my_numbers = [3,6,18,42]
42 in my_numbers

True

**Q5** Create 5 more logical operator tests and show the results.

In [28]:
# Your code here
5 <= 4

False

In [29]:
# Your code here
num = [38,43,65,23,12]
99 in num

False

In [30]:
# Your code here
"Dog" == 'Dog'

True

In [31]:
# Your code here
num = [38,43,65,23,12]
99 in num or 38 in num

True

In [32]:
# Your code here
num = [38,43,65,23,12]
99 in num and 38 in num

False

### Follow Along - Example Analysis

Now that we know some logical operators we can create fancier boolean masks and get more interesting subsets of our data. Here is a mini analysis of our data using what we know so far. See if you can figure out what each command is doing. 

**Q6** Add your comments explaining the code:


In [None]:
# Add your comments:  we count the amount of times a person who books a reservation is from what country
DF_raw_hotels['country'].value_counts()

country
PRT    48590
GBR    12129
FRA    10415
ESP     8568
DEU     7287
       ...  
DJI        1
BWA        1
HND        1
VGB        1
NAM        1
Name: count, Length: 177, dtype: int64

Wow there is a lot of data for PRT (Portugal)!

In [None]:
# Here I am defining the variables (columns) I care about
my_country = 'PRT'
my_columns = ['hotel',
              'arrival_date_year',
              'is_canceled', 
              'arrival_date_month',
              'country']
sort_column = 'arrival_date_year'

# Add your comments: we then make a data set of the varaibles we want and sorting one of the columns
DF_mini_analysis = DF_raw_hotels[my_columns].sort_values(sort_column).copy()
show(DF_mini_analysis)

0
Loading ITables v2.5.2 from the internet...  (need help?)


In [None]:
# What does this first mask do?
# Your comments: The first mask gets only the country we are looking for
mask_country = DF_mini_analysis['country']==my_country
DF_mini_analysis_masked = DF_mini_analysis[mask_country]
show(DF_mini_analysis_masked)

0
Loading ITables v2.5.2 from the internet...  (need help?)


In [None]:
# What does this second mask do?
# Your comments: Second mask gets a second mask of specific dates. It is similar to having a if statement that has and statement
mask_years = DF_mini_analysis_masked['arrival_date_year']>2015
DF_mini_analysis_masked = DF_mini_analysis_masked[mask_years] 
# Notice that I wrote over my data in the command above

show(DF_mini_analysis_masked)

0
Loading ITables v2.5.2 from the internet...  (need help?)


In [37]:
fig = px.histogram(DF_mini_analysis_masked,
                  x='arrival_date_month',
                  color='hotel',
                  facet_col='is_canceled',
                  facet_col_wrap=1)

fig.update_layout(title='Number of Hotels Booked Each Month in Portugal',
                  title_x=0.5,
                  xaxis_title="Month",
                  autosize=False,
                  width=800,
                  height=500)
fig.show()

**Q7** What exactly is this histogram showing and what is excluded based on the analysis we did above? Is there anything interesting about the graph? Any questions raised?

**Question 7 Answer**<br>
The histogram shows the amount of cancellations and non cancellations between the two hotels. The interesting thing we see is that there is alot of cancellations for the City Hotel compared to Resort Hotel. Why is there so many cancellations with the City Hotel.

**Problem 2** Now it's your turn to do a mini analysis. Please do the following:

1. Choose a subset of columns that you are interested in and look just at those column.
2. Make at least one mask that will remove some of the data (eg. look only at one categorical value OR one or more numerical values.
3. Sort the data using one of the columns.
4. Make a histogram or bar graph of your final data set.
5. Describe in words what you see in your graph, or if there is anything interesting there.

In [47]:
# Your code here - you will probably need more than one cell
my_col = ['hotel',
          'booking_changes', 
          'previous_bookings_not_canceled', 
          'previous_cancellations',
          'country',
          'reserved_room_type']

sort_col = 'country'

DF_analysis = DF_raw_hotels[my_col].sort_values(sort_col).copy()
show(DF_analysis)

0
Loading ITables v2.5.2 from the internet...  (need help?)


In [52]:
mask_book_changes = DF_analysis['booking_changes'] > 0
DF_analysis_masked = DF_analysis[mask_book_changes]
show(DF_analysis_masked)

0
Loading ITables v2.5.2 from the internet...  (need help?)


In [None]:
fig = px.histogram(DF_analysis_masked,
                  x='booking_changes',
                  color='hotel',
    
                  facet_col_wrap=1,
                  facet_row_spacing=.01)

fig.update_layout(title='Number of Hotels Booked Each Month in Portugal',
                  title_x=0.5,
                  xaxis_title="# of Book Changes")
                  #autosize=True)
fig.show() 

# End Of Homework

### Everything below here is extra - just so you have a copy of the code I showed in lecture.

### Excluding variables - removing one column (Extra)

There are a few ways to get rid of columns. The first would be to write down ALL the columns and leave out one. Let's say we want to leave out information about the "agent" variable.

In [41]:
# First get all of them
my_columns = list(DF_raw_hotels.keys())
# Then remove the one you don't want
my_columns.remove('agent')
# Then get the new data frame
DF_raw_hotels[my_columns]

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,booking_changes,deposit_type,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,3,No Deposit,,0,Transient,0.00,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,4,No Deposit,,0,Transient,0.00,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,0,No Deposit,,0,Transient,75.00,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,0,No Deposit,,0,Transient,75.00,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,0,No Deposit,,0,Transient,98.00,0,1,Check-Out,2015-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119385,City Hotel,0,23,2017,August,35,30,2,5,2,...,0,No Deposit,,0,Transient,96.14,0,0,Check-Out,2017-09-06
119386,City Hotel,0,102,2017,August,35,31,2,5,3,...,0,No Deposit,,0,Transient,225.43,0,2,Check-Out,2017-09-07
119387,City Hotel,0,34,2017,August,35,31,2,5,2,...,0,No Deposit,,0,Transient,157.71,0,4,Check-Out,2017-09-07
119388,City Hotel,0,109,2017,August,35,31,2,5,2,...,0,No Deposit,,0,Transient,104.40,0,0,Check-Out,2017-09-07


### Fancy ways to get columns (Extra - ADVANCED)

There are fancy ways to get a list of columns that you might be interested in.

#### Starts with

You can use the command **startswith()** to check if a variable starts with a word or part of a word. Here is code to get only columns that start with the word "arrival":

In [42]:
# Get the list of all the columns
columns_list = list(DF_raw_hotels.keys())
# Get true false values using the startswith() comand
column_mask = [column.startswith('arrival') for column in columns_list]
# Mask the column names to get just the ones you want and make a list
my_columns = list(DF_raw_hotels.keys()[column_mask])
print(my_columns)

['arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month']


#### Ends with

You can use the command **endswith()** to check if a variable ends with a word or part of a word. Here is code to get only columns that end with the word "type":

In [43]:
# Get the list of all the columns
columns_list = list(DF_raw_hotels.keys())
# Get true false values using the startswith() comand
column_mask = [column.endswith('type') for column in columns_list]
# Mask the column names to get just the ones you want and make a list
my_columns = list(DF_raw_hotels.keys()[column_mask])
print(my_columns)

['reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type']


#### IN

You can use the Python check  **is in** to check if a variable contains a word or part of a word. Here is code to get only columns that contain the word "date":

In [44]:
# Get the list of all the columns
columns_list = list(DF_raw_hotels.keys())
# Get true false values using the startswith() comand
column_mask = ["date" in column for column in columns_list]
# Mask the column names to get just the ones you want and make a list
my_columns = list(DF_raw_hotels.keys()[column_mask])
print(my_columns)

['arrival_date_year', 'arrival_date_month', 'arrival_date_week_number', 'arrival_date_day_of_month', 'reservation_status_date']
