# Connect and Insert Data to MySQL Server with Python

# Introduction 

In the world with increasing data availability, it is become a norm to store and collect data from a cloud database instead of using a local file when you are part of a start up or large corporation. Several service, such as Google Cloud Platform or Microsoft Azure provide us with a cloud storage service to store a large dataset. Most of the current database system use SQL to store and collect this data. Therefore, understanding how to write a command or a query using SQL is currently one of the top skills required in data-related job, espescially for a data analyst and data scientist. The following graph is the result of [2020 Data Science and Machine Learning Survey](https://research.aimultiple.com/data-science-tools/).

<center><img src="asset/survey.png"></center>

To learn more about how to run an SQL query, we will use one of the most common database management system: MySQL. On this occasion, I will guide you on how to do the following things with MySQL:

- Create a database with MySQL server
- Create multiple table
- Insert data into SQL table 
- Write query to collect data from SQL database

The full data analytics process that will analyze and gain insight from the data will be done in separate notebook.

For a quick introduction about SQL if you are not familiar with SQL, you can visit [this website](https://www.learnsqlonline.org/) and just read the welcome page or you can try some practice and come back here later.

# Set Up MySQL Server

There are a lot of options for you to start creating a database, either using common cloud service such as [Google Cloud Platform](https://help.appsheet.com/en/articles/3627169-create-a-mysql-database-hosted-in-google-cloud) or [Microsoft Azure](https://docs.microsoft.com/en-us/azure/mysql/flexible-server/quickstart-create-server-portal), or you can also try setting up a local mysql server in your device. However, setting up a database on these can be a quite long process. Since our goal is focus on preparing data and store in in mysql, we will use a free hosting website with [db4free](https://db4free.net/). 

![](asset/db4free.png)

This website help us set a free and small MySQL server for us to practice. You just need to [register](https://db4free.net/signup.php) with your email and enter the following information:

- The name of your MySQL database
- Your username to login
- Your password to login
- Your email address for validation

# Library

The following is the required library that we will use throughout this notebook.

In [1]:
# Data wrangling
import pandas as pd
import numpy as np

# Regular expression to transform string
import re

# Computation time
import time

# Connect to mysql server
import mysql.connector

# Data

We will use data of Singapore room listing from [Airbnb](http://insideairbnb.com/get-the-data.html). The dataset contain information about room or listing that is posted on Airbnb site in Singapore. 

The airbnb data contains 3 separate dataset for each region:

- Listing: Detailed information about listing or room posted on airbnb. One host can have multiple listings.
- Calendar: Detailed daily availability and price from each listing
- Reviews: Detailed customer review of airbnb listing

For the detailed information regarding each column, you can check the description for each column in [this spreadsheet](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit?usp=sharing).

Each dataset can be further broken down into several parts. For example, in listing data there are several columns that store information about the host instead of the room/listing. In practice, we can reduce the number of required space for the database by separating the data about listing and the data about host, since a single host can have multiple listing. 

To illustrate the relation between each data, we will draw an **Entity Relationship Diagram (ERD)**. An ERD is often used to help data engineer to design a database and show the relation between each table. 

![ERD of Airbnb](asset/ERD_Airbnb.png)

On each table, you can have a column that contain a unique ID to identify each row in a table. This column is called as **Primary Key (PK)**. For example, in host_info table, a unique host should only stored once and there is no duplicate in the table. This host is identified by the `host_id` column. There is also a column that is called as **Foreign Key (FK)**. For example, the listing table has `id` as the primary key and `host_id` as the foreign key. This means that we can join the listing table with the host_info table by matching the `host_id` on the listing table with the `host_id` on the host_info table. 

The relation or the cardinality between table is illustrated by the sign at both end of the arrow. There are different cardinality in ERD, you can look at the detailed explanation [here](https://www.youtube.com/watch?v=QpdhBUYk7Kk). A host can have zero (no listing) or many listing, so the cardinality at the listing side is illustrated as *zero or many*. Naturally, a room listing can only be owned by a single host. There is no listing that is owned by different hosts. Therefore, the cardinality at the host_info side is illustrated as *one (and only one)*.

<center><img src="asset/erd_cardinality.png" style="width: 400px;"/></center>

## Host
### Processing Host Table

We will start creating and cleansing a host table from the listing data. Let's start by importing the listing data.

In [2]:
df_listing = pd.read_csv("data/listings.csv")

print("Number of rows: %i" %df_listing.shape[0])
print("Number of columns: %i" %df_listing.shape[1])

Number of rows: 4388
Number of columns: 74


Let's check the information from the dataset.

In [3]:
df_listing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4388 entries, 0 to 4387
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            4388 non-null   int64  
 1   listing_url                                   4388 non-null   object 
 2   scrape_id                                     4388 non-null   int64  
 3   last_scraped                                  4388 non-null   object 
 4   name                                          4388 non-null   object 
 5   description                                   4241 non-null   object 
 6   neighborhood_overview                         2787 non-null   object 
 7   picture_url                                   4388 non-null   object 
 8   host_id                                       4388 non-null   int64  
 9   host_url                                      4388 non-null   o

The listing dataset contain information about each room listing or room posted on the airbnb website by the host. Several columns contain missing values, but we will deal with them later. 

A single host is identified by the `host_id` and may have multiple room in the listing. Let's check this hypothesis.

In [4]:
df_listing.value_counts('host_id')

host_id
138649185    157
66406177     142
2413412      137
238891646    116
8948251      109
            ... 
47987680       1
48018398       1
48108016       1
48110833       1
393244617      1
Length: 1205, dtype: int64

As we have seen from the above output, several hosts even have hundreds of listing. For efficient storage in the database, we will separate information about the host, such as `host_id`, `host_url`, etc. from the information about the listing.

In [5]:
# Collect columns related to host 
df_host1 = df_listing.loc[:, 'host_id':'host_identity_verified']
df_host2 = df_listing.loc[:, 'calculated_host_listings_count':'calculated_host_listings_count_shared_rooms']

df_host = pd.concat([df_host1, df_host2], axis = 1)

df_host.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4388 entries, 0 to 4387
Data columns (total 22 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   host_id                                       4388 non-null   int64  
 1   host_url                                      4388 non-null   object 
 2   host_name                                     4328 non-null   object 
 3   host_since                                    4328 non-null   object 
 4   host_location                                 4327 non-null   object 
 5   host_about                                    2985 non-null   object 
 6   host_response_time                            3707 non-null   object 
 7   host_response_rate                            3707 non-null   object 
 8   host_acceptance_rate                          3394 non-null   object 
 9   host_is_superhost                             4328 non-null   o

Now we have 22 columns for the host and the rest is columns related to the individual listing.

There should be no duplicate host, so we will remove duplicated host.

In [6]:
df_host = df_host[ df_host.duplicated('host_id') == False ]

print("Number of rows: %i" %df_host.shape[0])

Number of rows: 1205


The next we do is preparing the dataset so they will have a proper data type in the database. You may have noticed that some columns should be have a boolean or logical data type, such as the `host_is_superhost`, `host_has_profile_pic`, and `host_identity_verified`. They contain string *t* if the value is *True* and *f* if the value is *False*. We will transform the data to the proper data type.

In [7]:
for col in ['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified']:
    
    df_host[col] = list(map(lambda x: True if x == "t" else False if x == "f" else np.nan, 
                            df_host[col]
                           )
                       )

The next thing that I want to clean up is the `host_verification` column that is in a list format like following sample.

In [8]:
df_host['host_verifications'][0:5]

0    ['email', 'phone', 'facebook', 'reviews', 'jum...
1    ['email', 'phone', 'facebook', 'reviews', 'off...
3    ['email', 'phone', 'reviews', 'manual_offline'...
7    ['email', 'phone', 'reviews', 'jumio', 'offlin...
9    ['email', 'phone', 'facebook', 'reviews', 'off...
Name: host_verifications, dtype: object

We will clean them so they contain a simple string text. For example, the first row will be *email, phone, facebook, ...*.

In [9]:
# Remove square bracket and apostrophe
df_host['host_verifications'] = list(map(lambda x: re.sub("[\'\[\]]", '', x), df_host['host_verifications']))

df_host['host_verifications'][0:5]

0    email, phone, facebook, reviews, jumio, offlin...
1    email, phone, facebook, reviews, offline_gover...
3    email, phone, reviews, manual_offline, work_email
7    email, phone, reviews, jumio, offline_governme...
9    email, phone, facebook, reviews, offline_gover...
Name: host_verifications, dtype: object

### Connect to MySQL Server

After you have set up the mysql server, now you can connect into the server with python using **mysql.connector**. You need the following information from the server:

- host name
- port
- username
- password
- database name

In [10]:
mydb = mysql.connector.connect(
    host = "db4free.net",
    port = 3306,
    user = "*****",
    password = "*****",
    database = "*****"
)

print(mydb)

<mysql.connector.connection_cext.CMySQLConnection object at 0x7f18b7e29810>


You can do a simple query and return a pandas dataframe directly. To see if your database contain any table, you can run the query **SHOW TABLES** like the following code. This will return a dataframe with the name of all table in your MySQL Server. 

In [11]:
pd.read_sql("SHOW TABLES", mydb)

Unnamed: 0,Tables_in_airbnb_learning
0,calendar
1,host_info
2,listing
3,review
4,reviewer


### Create Host Table

We will start creating a table for the database. Since there is no numeric or decimal value from the host data, we will assign all numeric value into integer. The next thing we need to do is to design a proper maximum number of characters for the string text columns. We will check the maximum length of each string column in the host dataset.

In [12]:
# Function to check maximum character length
def check_char(data):
    print("Maximum Character Length")
    
    for col in data.columns:
        char_length = list(map(lambda x: len( str(x) ), data[col]))
        print(col + ": %i" %np.max(char_length))

In [13]:
check_char(df_host.select_dtypes('object'))

Maximum Character Length
host_url: 43
host_name: 35
host_since: 10
host_location: 183
host_about: 2858
host_response_time: 18
host_response_rate: 4
host_acceptance_rate: 4
host_is_superhost: 5
host_thumbnail_url: 106
host_picture_url: 109
host_neighbourhood: 18
host_verifications: 134
host_has_profile_pic: 5
host_identity_verified: 5


Now we can create a table named `host_info` with the following data type:

- INT: integer, for ID, quantity, or other non-decimal numeric value
- VARCHAR(n): String or character with maximum length of n
- DATE: date (format in YYYY-MM-DD)
- BOOLEAN: logical (TRUE or FALSE)
- column `host_id` as primary key

Make sure the number of character (n) in VARCHAR is bigger than the length of your data characters. For example, the maximum length of `host_location` is 183, so you can create a VARCHAR(200) or VARCHAR(500) just to make sure. 

For more detailed information about different data type allowed by MySQL, you can check the following [manuals](https://dev.mysql.com/doc/refman/8.0/en/data-types.html).

The query for creating a table has the following template:

*CREATE TABLE table_name (column_name_1 data_type_1, column_name_1 data_type_1, ....)*

In [14]:
query = """
CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since DATE,
host_location VARCHAR(500),
host_about VARCHAR(5000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)
"""

query

'\nCREATE TABLE host_info(\nhost_id INT,\nhost_url VARCHAR(50),\nhost_name VARCHAR(100),\nhost_since DATE,\nhost_location VARCHAR(500),\nhost_about VARCHAR(5000),\nhost_response_time VARCHAR(50),\nhost_response_rate VARCHAR(50),\nhost_acceptance_rate VARCHAR(50),\nhost_is_superhost BOOLEAN,\nhost_thumbnail_url VARCHAR(500),\nhost_picture_url VARCHAR(500),\nhost_neighbourhood VARCHAR(50),\nhost_listings_count INT,\nhost_total_listings_count INT,\nhost_verifications VARCHAR(500),\nhost_has_profile_pic BOOLEAN,\nhost_identity_verified BOOLEAN,\ncalculated_host_listings_count INT,\ncalculated_host_listings_count_entire_homes INT,\ncalculated_host_listings_count_private_rooms INT,\ncalculated_host_listings_count_shared_rooms INT,\nPRIMARY KEY(host_id)\n)\n'

After you have created the SQL query, you can execute them by creating a cursor from your connection (`mydb`) object.

You can use `read_sql()` function to check the description of your table from MySQL server. Some information about the output:

- Field: name of the column
- Type: data type
- Null: is missing value allowed?
- Key: is this column a primary key (PRI) or a foreign key?

In [15]:
pd.read_sql("DESCRIBE host_info", mydb)

Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,host_id,b'int',NO,PRI,,
1,host_url,b'varchar(50)',YES,,,
2,host_name,b'varchar(100)',YES,,,
3,host_since,b'date',YES,,,
4,host_location,b'varchar(500)',YES,,,
5,host_about,b'varchar(5000)',YES,,,
6,host_response_time,b'varchar(50)',YES,,,
7,host_response_rate,b'varchar(50)',YES,,,
8,host_acceptance_rate,b'varchar(50)',YES,,,
9,host_is_superhost,b'tinyint(1)',YES,,,


If you want to check the content or the data from `host_info` table, you can use the following query, although for now the table is still empty because we don't insert anything yet into the table.


SELECT * FROM host_info means that you want to get all columns ( * ) from host_info table.


If for some reason you want to delete the table, you can run the following SQL query:

*DROP TABLE table_name*

### Insert Data to Host Table

Let's start inserting data into MySQL server. The generic formula to insert data into the table is as follows:

*INSERT INTO table_name (column_name) VALUES (value_for_each_column)*

For example, if you have a table named **customer** that has `reviewer_id` and `reviewer_name` column, you can insert a row with the following query:

*INSERT INTO customer (reviewer_id, reviewer_name) VALUES (123, 'John Doe')*

So we need to insert the name of the column and the respective value for each column. Let's start by preparing the column name into a single string.

In [16]:
# Get column name
column_name = df_host.columns.to_list()

# Convert column name to single string 
column_name = ",".join(x for x in column_name)

column_name

'host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms'

Next, we can prepare the value into a single string for each row. For example, the following is the input for the first row of the host data.

In [17]:
", ".join(str(x) for x in df_host.iloc[0])

'266763, https://www.airbnb.com/users/show/266763, Francesca, 2010-10-20, Singapore, I am a private tutor by profession. My husband and I are simple easy going folks.\r\nWe have well mannered cats.Welcome to our home. I am sure you will find our home pleasant, clean and comfortable.\r\nPerhaps if you tell us a little about yourself, we can try to make your stay more enjoyable and comfortable.Once again, welcome!!!, within an hour, 100%, nan, False, https://a0.muscache.com/im/pictures/user/5c755ad6-c678-4e7f-bd6c-0ca8a8dec274.jpg?aki_policy=profile_small, https://a0.muscache.com/im/pictures/user/5c755ad6-c678-4e7f-bd6c-0ca8a8dec274.jpg?aki_policy=profile_x_medium, Woodlands, 2.0, 2.0, email, phone, facebook, reviews, jumio, offline_government_id, selfie, government_id, identity_manual, True, True, 2, 0, 2, 0'

Just like when you read a csv file, an input for a single column is separated by comma (,) value by default. For example, input for the first column `host_id` is 266763, input for the second column is the string for `host_url`, etc. However, you may notice that at the middle of the string we have an extra comma from the `host_verifications` (email, phone, facebook) which should go into a single column. We need to clean the string so they can fit properly to how the SQL will read the data.

To get a clean values, we need to process the string column first with the following rules:

- A string should be started and ended with quotation mark (""), e.g. "email, phone, facebook", "Fransesca"
- Missing values (NaN) should not be quoted and should be transformed to NULL for SQL
- Logical value should not be quoted

To quickly clean our data, we will define a function that will help us.

In [18]:
def clean_char(data):
    
    # Replace None with numpy.nan
    data.fillna(value=np.nan, inplace = True)
    
    for col in data.select_dtypes('object').columns:
        
        # Get index with no missing values
        non_na = data[col][ data[col].isna() == False ].index
        
        # Change quotation mark with apostrophe
        data.loc[non_na, col] = list(map(lambda x: re.sub('"', "'", str(x)), data.loc[non_na, col] ))
        
        # Add quotation mark at the start and end of string
        data.loc[non_na, col] = list(map(lambda x: '"' + x + '"', data.loc[non_na, col]))
               
        # Remove quotation mark from logical value (True and False)
        data.loc[non_na, col] = list(map(lambda x: re.sub('"True"', 'True', x), data.loc[non_na, col]))
        data.loc[non_na, col] = list(map(lambda x: re.sub('"False"', 'False', x), data.loc[non_na, col]))
        
    return(data)

Let's apply the function to our host data.

In [19]:
df_host = clean_char(df_host)

In [20]:
df_host

Unnamed: 0,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,...,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,266763,"""https://www.airbnb.com/users/show/266763""","""Francesca""","""2010-10-20""","""Singapore""","""I am a private tutor by profession. My husban...","""within an hour""","""100%""",,False,...,"""Woodlands""",2.0,2.0,"""email, phone, facebook, reviews, jumio, offli...",True,True,2,0,2,0
1,227796,"""https://www.airbnb.com/users/show/227796""","""Sujatha""","""2010-09-08""","""Singapore, Singapore""","""I am a working professional, living in Singap...",,,,False,...,"""Bukit Timah""",1.0,1.0,"""email, phone, facebook, reviews, offline_gove...",True,True,1,0,1,0
3,367042,"""https://www.airbnb.com/users/show/367042""","""Belinda""","""2011-01-29""","""Singapore""","""Hi My name is Belinda -Housekeeper \r\n\r\nI ...","""within a few hours""","""100%""",,False,...,"""Tampines""",8.0,8.0,"""email, phone, reviews, manual_offline, work_e...",True,True,5,0,5,0
7,1439258,"""https://www.airbnb.com/users/show/1439258""","""Joyce""","""2011-11-24""","""Singapore""","""K2 Guesthouse is designed for guests who want...","""within a few hours""","""95%""","""100%""",False,...,"""Bukit Merah""",49.0,49.0,"""email, phone, reviews, jumio, offline_governm...",True,True,47,0,47,0
9,1521514,"""https://www.airbnb.com/users/show/1521514""","""Elizabeth""","""2011-12-20""","""Singapore""","""Hello !\r\nI am Elizabeth from Singapore.\r\n...","""within a few hours""","""100%""",,False,...,"""Central Area""",6.0,6.0,"""email, phone, facebook, reviews, offline_gove...",True,True,7,1,6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4374,384032484,"""https://www.airbnb.com/users/show/384032484""","""Glariant""","""2021-01-11""","""SG""",,"""within an hour""","""100%""",,False,...,"""Kallang""",0.0,0.0,"""email, phone""",True,False,1,0,1,0
4375,61893984,"""https://www.airbnb.com/users/show/61893984""","""Derrick""","""2016-03-07""","""Singapore""",,,,,False,...,"""Jurong East""",0.0,0.0,"""phone, facebook, offline_government_id, gover...",True,False,1,0,1,0
4377,393078492,"""https://www.airbnb.com/users/show/393078492""","""Cc""","""2021-03-18""","""Singapore""",,,,"""100%""",False,...,"""Kallang""",1.0,1.0,"""phone""",True,True,1,1,0,0
4383,393244617,"""https://www.airbnb.com/users/show/393244617""","""Wesley""","""2021-03-19""","""Singapore""","""Entrepreneur, Song Writer. Love travel, Boxin...",,,,False,...,"""Yau Ma Tei""",1.0,1.0,"""email, phone, work_email""",True,True,1,1,0,0


Let's check the input value for the second observation.

In [21]:
# Convert value of each row into a single string
value_data = ", ".join(str(x) for x in df_host.iloc[1])

value_data

'227796, "https://www.airbnb.com/users/show/227796", "Sujatha", "2010-09-08", "Singapore, Singapore", "I am a working professional, living in Singapore with my husband & 2 daughters.  \r\n\r\n", nan, nan, nan, False, "https://a0.muscache.com/im/pictures/user/8fd2cddb-2795-40b8-8231-d3a34dc3a1e4.jpg?aki_policy=profile_small", "https://a0.muscache.com/im/pictures/user/8fd2cddb-2795-40b8-8231-d3a34dc3a1e4.jpg?aki_policy=profile_x_medium", "Bukit Timah", 1.0, 1.0, "email, phone, facebook, reviews, offline_government_id, selfie, government_id, work_email", True, True, 1, 0, 1, 0'

Now you can see that even if a column has internal comma, such as the `host_verifications`, they will not be a problem because we give a quotation mark for a string so the SQL will understand that the comma will not be read as a command to fill the next column.

Finally, We need to transform a missing value `nan` into explicit NULL since SQL only consisder NULL as the legitimate missing value.

In [22]:
# Replace missing value with explicit NULL
value_data = re.sub('\\bnan\\b', 'NULL', value_data)

value_data

'227796, "https://www.airbnb.com/users/show/227796", "Sujatha", "2010-09-08", "Singapore, Singapore", "I am a working professional, living in Singapore with my husband & 2 daughters.  \r\n\r\n", NULL, NULL, NULL, False, "https://a0.muscache.com/im/pictures/user/8fd2cddb-2795-40b8-8231-d3a34dc3a1e4.jpg?aki_policy=profile_small", "https://a0.muscache.com/im/pictures/user/8fd2cddb-2795-40b8-8231-d3a34dc3a1e4.jpg?aki_policy=profile_x_medium", "Bukit Timah", 1.0, 1.0, "email, phone, facebook, reviews, offline_government_id, selfie, government_id, work_email", True, True, 1, 0, 1, 0'

The full query for the first row is as follow. THis is the query or command to store the first observation into MySQL.

In [23]:
"INSERT INTO host_info (" + column_name + ") VALUES (" + value_data + ")"

'INSERT INTO host_info (host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms) VALUES (227796, "https://www.airbnb.com/users/show/227796", "Sujatha", "2010-09-08", "Singapore, Singapore", "I am a working professional, living in Singapore with my husband & 2 daughters.  \r\n\r\n", NULL, NULL, NULL, False, "https://a0.muscache.com/im/pictures/user/8fd2cddb-2795-40b8-8231-d3a34dc3a1e4.jpg?aki_policy=profile_small", "https://a0.muscache.com/im/pictures/user/8fd2cddb-2795-40b8-8231-d3a34dc3a1e4.jpg?aki_policy=profile_x_medium", "Bukit Timah", 1.0, 1.0, "email, phone, facebook, reviews

Now that you have understand how to create a query for inserting a single row of data, we will automate this process by creating a function. The following is the full code to insert multiple row of data simultaneously into MySQL. You just need to enter the initial pandas dataframe and the target table in the MySQL server.

To insert multiple row to data, the SQL command template is as follows:

*INSERT INTO table_name (column_1, column_2, ...)\
VALUES\
(value_row_1_column_1, value_row_1_column_2, ...),\
(value_row_2_column_1, value_row_2_column_2, ...),\
(value_row_3_column_1, value_row_3_column_2, ...)*

In [24]:
def insert_to_sql(data, table_name):
    
    # Start time counter
    start = time.time()
    
    # Convert column name to single string
    column_name = data.columns.to_list()
    column_name = ",".join(x for x in column_name)
    
    # Create initial query
    query = "INSERT INTO " + table_name + " (" + column_name + ") VALUES" 
    
    # Preparing data
    
    print("Preparing data")
    
    # Create empty list to store input for each row
    value_list = []
    
    for i in range(data.shape[0]):
        
        # Join all values in a single row into a single string
        value_data = ", ".join(str(x) for x in data.iloc[i])
    
        # Add bracket at start and end of a single row insertion
        value_data = "(" + value_data + ")"
    
        # Add the string into list of input
        value_list.append(value_data)

    # Join all string into a single giant string
    join_value = ", ".join( x for x in value_list)

    # Join initial query with the data value
    query = " ".join( [query, join_value])

    # Replace missing value with explicit NULL
    query = re.sub("\\bnan\\b", "NULL", query)   
    
    print("Inserting data to MySQL")

    # Execute query to MySQL
    cursor = mydb.cursor()
    cursor.execute(query)
    mydb.commit()
    cursor.close()
    
    # Calculate run time 
    end = time.time()
    print('Processing time: %.4f seconds' %(end - start) )

Let's start inserting host data into MySQL `host_info` table.

If, for some reason you want to delete all the records in your MySQL table, you can run the following SQL query:

*DELETE FROM table_name*

### Query the Host Table

Now we will try to do some query to the `host_info` table.

Let's try to get the first 5 observation from the `host_info` table. The **LIMIT** command is the same as **head()** in pandas and will give us the first n row.

In [25]:
query = "SELECT * FROM host_info LIMIT 5"

pd.read_sql(query, mydb)

Unnamed: 0,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,...,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,23666,https://www.airbnb.com/users/show/23666,Maryanne,2009-06-29,Singapore,I'm a true blue Singaporean who knows my city ...,within an hour,100%,100%,1,...,Marine Parade,1,1,"email, phone, reviews, offline_government_id, ...",1,1,1,1,0,0
1,227796,https://www.airbnb.com/users/show/227796,Sujatha,2010-09-08,"Singapore, Singapore","I am a working professional, living in Singapo...",,,,0,...,Bukit Timah,1,1,"email, phone, facebook, reviews, offline_gover...",1,1,1,0,1,0
2,244567,https://www.airbnb.com/users/show/244567,Sherry,2010-09-24,"San Francisco, California, United States","Just moved to Singapore! Love to travel, see f...",,,,0,...,,1,1,"email, phone, reviews, kba, work_email",1,1,1,1,0,0
3,266763,https://www.airbnb.com/users/show/266763,Francesca,2010-10-20,Singapore,I am a private tutor by profession. My husband...,within an hour,100%,,0,...,Woodlands,2,2,"email, phone, facebook, reviews, jumio, offlin...",1,1,2,0,2,0
4,343908,https://www.airbnb.com/users/show/343908,Matthew,2011-01-11,Singapore,Hey there! My name is Matthew and I've been li...,,,,0,...,Chinatown,2,2,"email, phone, facebook, reviews, work_email",1,0,1,0,1,0


You can try to do conditional filtering of the data. For example, let's try to get the first 5 host who is considered as a superhost. To create a condition, you can use **WHERE** command.

In [26]:
query = "SELECT * FROM host_info WHERE host_is_superhost = TRUE LIMIT 5"

pd.read_sql(query, mydb)

Unnamed: 0,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,...,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,23666,https://www.airbnb.com/users/show/23666,Maryanne,2009-06-29,Singapore,I'm a true blue Singaporean who knows my city ...,within an hour,100%,100%,1,...,Marine Parade,1,1,"email, phone, reviews, offline_government_id, ...",1,1,1,1,0,0
1,772728,https://www.airbnb.com/users/show/772728,Sunrise,2011-07-03,"Singapore, Singapore",,within an hour,100%,,1,...,Geylang,1,1,"email, phone, reviews, jumio, offline_governme...",1,1,2,2,0,0
2,1030128,https://www.airbnb.com/users/show/1030128,Lena,2011-08-28,Singapore,I enjoy travelling as well as welcoming guests...,within a few hours,100%,100%,1,...,Jurong West,3,3,"email, phone, facebook, reviews, jumio, offlin...",1,0,4,0,4,0
3,1170341,https://www.airbnb.com/users/show/1170341,Jenny,2011-09-17,Singapore,I am from Singapore. Housewife. Love to cook...,within an hour,100%,,1,...,,3,3,"email, phone, reviews, jumio, offline_governme...",1,1,3,0,3,0
4,1584407,https://www.airbnb.com/users/show/1584407,Richard,2012-01-09,"Singapore, Singapore","Enjoyed saving, refurbishing and re-developin...",,,50%,1,...,Marine Parade,5,5,"email, phone, facebook, reviews, manual_offlin...",1,1,4,2,2,0


You can get the number of row in the data using COUNT( * )

In [27]:
# Total number of row in host_info
query = "SELECT COUNT(*) FROM host_info"

pd.read_sql(query, mydb)

Unnamed: 0,COUNT(*)
0,1205


## Listing
### Processing Listing Table

We will do full analysis of the data and understand different kind of SQL command in another notebook. For now, we will finish inserting all data into the database. Let's continue processing the remaining column from the listing dataset.

In [28]:
# Remove column related to host except host_id
df_new_listing = df_listing.drop( df_host.columns[1:], axis = 1).copy()

df_new_listing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4388 entries, 0 to 4387
Data columns (total 53 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            4388 non-null   int64  
 1   listing_url                   4388 non-null   object 
 2   scrape_id                     4388 non-null   int64  
 3   last_scraped                  4388 non-null   object 
 4   name                          4388 non-null   object 
 5   description                   4241 non-null   object 
 6   neighborhood_overview         2787 non-null   object 
 7   picture_url                   4388 non-null   object 
 8   host_id                       4388 non-null   int64  
 9   neighbourhood                 2787 non-null   object 
 10  neighbourhood_cleansed        4388 non-null   object 
 11  neighbourhood_group_cleansed  4388 non-null   object 
 12  latitude                      4388 non-null   float64
 13  lon

Some columns, such as the `calendar_updated`, `license`, and `bathrooms` are only consists of missing values. The `scrape_id` is also has no apparent meaning for now since we already has the `last_scraped` column to show us the lates date of scraping. These columns are not informative, so we will drop them.

In [29]:
# Delete columns with all missing values and scrape_id
df_new_listing.drop(['license', 'calendar_updated', 'bathrooms', 'scrape_id'], axis = 1, inplace = True)

Next, we will convert columns that should has a boolean/logical value, including `has_availability` and `instant_bookable`.

In [30]:
for col in ['has_availability', 'instant_bookable']:
    
    df_new_listing[col] = list(map(lambda x: True if x == "t" else False if x == "f" else np.nan, df_new_listing[col]))

We will continue by transforming the `price` column into decimal/numeric values by removing the dollar sign from the string.

In [31]:
df_new_listing['price'] = list(map(lambda x: float(re.sub('[$,]', '', x)), df_new_listing['price']))

df_new_listing['price'][0:5]

0     81.0
1     80.0
2     67.0
3    177.0
4     81.0
Name: price, dtype: float64

Now we will continue transforming the `amenities` with the same treatment as the `host_verification` column from the host dataset.

In [32]:
df_new_listing['amenities'] = list(map(lambda x: re.sub("[\"\'\[\]]", '', x), df_new_listing['amenities']))

df_new_listing['amenities'][0:5]

0    Wifi, Elevator, Long term stays allowed, Air c...
1    Wifi, Kitchen, Elevator, Long term stays allow...
2    Wifi, Kitchen, Elevator, Air conditioning, TV,...
3    Hair dryer, Kitchen, Free street parking, Keyp...
4    Coffee maker, Hair dryer, Kitchen, Free street...
Name: amenities, dtype: object

We will clean the string object with the `clean_char()` function from the previous step.

In [33]:
df_new_listing = clean_char(df_new_listing)

### Create Listing Table

Now we will start creating a listing table in the MySQL server. First, let's check the length of each string column in the data.

In [34]:
print("Maximum Character Length")
check_char(df_new_listing.select_dtypes('object'))

Maximum Character Length
Maximum Character Length
listing_url: 39
last_scraped: 12
name: 75
description: 1002
neighborhood_overview: 1002
picture_url: 128
neighbourhood: 43
neighbourhood_cleansed: 25
neighbourhood_group_cleansed: 19
property_type: 36
room_type: 17
bathrooms_text: 19
amenities: 972
calendar_last_scraped: 12
first_review: 12
last_review: 12


Let's start writing the query for creating the table. For a numeric values such as price, we will use **Decimal** data type while the rest is the same as the host table. The primary key for the listing table is the `id`, which indicate the listing id. The foreign key for the listing table is `host_id`, which we can use to join the listing table with the host_info table for analysis.

In [35]:
query = """
CREATE TABLE listing(
id INT,
listing_url VARCHAR(100),
last_scraped DATE,
name VARCHAR(500),
description VARCHAR(2000),
neighborhood_overview VARCHAR(2000),
picture_url VARCHAR(500),
host_id INT,
neighbourhood VARCHAR(100),
neighbourhood_cleansed VARCHAR(100),
neighbourhood_group_cleansed VARCHAR(100),
latitude DECIMAL(25,18),
longitude DECIMAL(25, 18),
property_type VARCHAR(100),
room_type VARCHAR(100),
accommodates INT,
bathrooms_text VARCHAR(100),
bedrooms INT,
beds INT,
amenities VARCHAR(2000),
price DECIMAL(10, 5),
minimum_nights INT,
maximum_nights INT,
minimum_minimum_nights INT,
maximum_minimum_nights INT,
minimum_maximum_nights INT,
maximum_maximum_nights INT,
minimum_nights_avg_ntm DECIMAL(16, 5),
maximum_nights_avg_ntm DECIMAL(16, 5),
has_availability BOOLEAN,
availability_30 INT,
availability_60 INT,
availability_90 INT,
availability_365 INT,
calendar_last_scraped DATE,
number_of_reviews INT,
number_of_reviews_ltm INT,
number_of_reviews_l30d INT,
first_review DATE,
last_review DATE,
review_scores_rating DECIMAL(10, 5),
review_scores_accuracy DECIMAL(10, 5),
review_scores_cleanliness DECIMAL(10, 5),
review_scores_checkin DECIMAL(10, 5),
review_scores_communication DECIMAL(10, 5),
review_scores_location DECIMAL(10, 5),
review_scores_value DECIMAL(10, 5),
instant_bookable BOOLEAN,
reviews_per_month DECIMAL(10, 5),
PRIMARY KEY(id),
FOREIGN KEY(host_id) REFERENCES host_info(host_id)
)
"""

Let's run the query and create a listing table.

You can check the listing table with the following query.

### Insert Data to Listing Table

Now we will insert each row into the table that we have defined previously. You just need to enter the initial pandas dataframe and the target table in the MySQL server.

### Query the Listing Table

Let's do some simple query to check the data that we have inserted. 

Let's check how many rows are there on the listing table.

In [36]:
pd.read_sql("SELECT COUNT(*) FROM listing", mydb)

Unnamed: 0,COUNT(*)
0,4388


I want to see the top 10 property type based on the frequency from the listing. We can use **COUNT** to get the number of row and use **GROUP BY** to make sure we count the number of row for each property type. To sort the value, we can use **ORDER BY**, folllowed by **DESC** to indicate that we want to sort descending and finally we only take the first 10 row with **LIMIT**.

In [37]:
query = """
SELECT COUNT(*) as frequency, property_type
FROM listing
GROUP BY property_type
ORDER BY COUNT(*) DESC
LIMIT 10
"""

pd.read_sql(query, mydb)

Unnamed: 0,frequency,property_type
0,953,Private room in apartment
1,845,Entire condominium
2,545,Entire apartment
3,375,Private room in condominium
4,334,Entire serviced apartment
5,217,Private room in house
6,195,Room in boutique hotel
7,131,Room in hotel
8,124,Room in hostel
9,82,Private room in townhouse


How about the top 10 most expensive listing? We can check the sorting the data with the `price` column.

In [38]:
query = """
SELECT price, property_type, neighbourhood
FROM listing
ORDER BY price DESC
LIMIT 10
"""

pd.read_sql(query, mydb)

Unnamed: 0,price,property_type,neighbourhood
0,10286.0,Entire condominium,
1,7286.0,Entire apartment,"Singapore, Singapore"
2,7000.0,Private room in apartment,
3,5000.0,Entire house,"Singapore, Singapore"
4,4311.0,Entire serviced apartment,
5,4000.0,Entire apartment,"Singapore, Singapore"
6,4000.0,Private room in apartment,"Singapore, Singapore"
7,3770.0,Shared room in apartment,
8,3300.0,Entire condominium,"Singapore, Singapore"
9,3000.0,Entire place,


## Reviewer

### Processing Reviewer Data

After we have completed processing the listing data, now we move on to review dataset. The data consists of information regarding the date of the comment, the reviewer and the content of the review or the comment. 

In [39]:
df_review = pd.read_csv("data/reviews.csv")

df_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53675 entries, 0 to 53674
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   listing_id     53675 non-null  int64 
 1   id             53675 non-null  int64 
 2   date           53675 non-null  object
 3   reviewer_id    53675 non-null  int64 
 4   reviewer_name  53675 non-null  object
 5   comments       53574 non-null  object
dtypes: int64(3), object(3)
memory usage: 2.5+ MB


Since the reviewer has information about reviewer id and the reviewer name, we will split this columns into separate dataset.

In [40]:
df_reviewer = df_review.loc[:, 'reviewer_id':'reviewer_name']

Let's check if a reviewer comment on more than 1 listing.

In [41]:
df_reviewer.value_counts('reviewer_id')

reviewer_id
3289968      19
350742662    19
97681492     18
163650600    15
10011442     15
             ..
47533960      1
47535495      1
47535590      1
47536393      1
392309572     1
Length: 49185, dtype: int64

Now we will remove duplicated reviewer to reduce the dimension of the reviewer data.

In [42]:
df_reviewer = df_reviewer[ df_reviewer['reviewer_id'].duplicated() == False ]

print("Number of rows: %i" %df_reviewer.shape[0])

Number of rows: 49185


Finally, since we only have 2 column, we just need to clean the string on the reviewer name and it is all complete.

In [43]:
df_reviewer = clean_char(df_reviewer)

In [44]:
df_reviewer.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49185 entries, 0 to 53673
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   reviewer_id    49185 non-null  int64 
 1   reviewer_name  49185 non-null  object
dtypes: int64(1), object(1)
memory usage: 2.1+ MB


### Create Reviewer Table

Now we will start creating a reviewer table in the MySQL server. First, let's check the length of each string column in the data.

In [45]:
print("Maximum Character Length")
check_char(df_reviewer.select_dtypes('object'))

Maximum Character Length
Maximum Character Length
reviewer_name: 41


Let's create the query for creating the table.

In [46]:
query = """
CREATE TABLE reviewer (
reviewer_id INT,
reviewer_name VARCHAR(50),
PRIMARY KEY (reviewer_id)
)
"""

Finally, execute the query to create the reviewer table.

Let's check the reviewer table that we have created.

In [47]:
query = "DESCRIBE reviewer"

pd.read_sql(query, mydb)

Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,reviewer_id,b'int',NO,PRI,,
1,reviewer_name,b'varchar(50)',YES,,,


### Insert Data to Reviewer Table

Now we will insert each row into the table.

### Query the Reviewer Table

Let's check the number of unique reviewer on the table.

In [48]:
pd.read_sql("SELECT COUNT(*) FROM reviewer", mydb)

Unnamed: 0,COUNT(*)
0,49185


Let's see if there is identical name from different reviewer. It is natural that certain name, especially popular name, to show up multiple time even if they are different person (has different reviewer_id)

In [49]:
query = """
SELECT reviewer_name, COUNT(*) as frequency 
FROM reviewer
GROUP BY reviewer_name
ORDER BY COUNT(*) DESC
LIMIT 10
"""

pd.read_sql(query, mydb)

Unnamed: 0,reviewer_name,frequency
0,David,267
1,Michael,242
2,Daniel,192
3,John,185
4,Andrew,166
5,Alex,157
6,Paul,135
7,Kevin,124
8,James,122
9,Peter,122


## Review

### Processing Review Data

For the review data, we just need to drop the `reviewer_name` and clean the string columns.

In [50]:
df_new_review = df_review.drop('reviewer_name', axis = 1).copy()
df_new_review = clean_char(df_new_review)

df_new_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53675 entries, 0 to 53674
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   53675 non-null  int64 
 1   id           53675 non-null  int64 
 2   date         53675 non-null  object
 3   reviewer_id  53675 non-null  int64 
 4   comments     53574 non-null  object
dtypes: int64(3), object(2)
memory usage: 2.0+ MB


### Create Review Table

Let's continue by creating the review table.

In [51]:
print("Maximum Character Length")
check_char(df_new_review.select_dtypes('object'))

Maximum Character Length
Maximum Character Length
date: 12
comments: 5474


The review table has 2 foreign key, one that connect to the reviewer and the other connect to the listing that they are commented.

In [52]:
query = """
CREATE TABLE review (
listing_id INT,
id INT,
date DATE,
reviewer_id INT,
comments VARCHAR(6000),
PRIMARY KEY (id),
FOREIGN KEY (reviewer_id) REFERENCES reviewer(reviewer_id),
FOREIGN KEY (listing_id) REFERENCES listing(id)
)
"""

In [53]:
pd.read_sql("DESCRIBE review", mydb)

Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,listing_id,b'int',YES,MUL,,
1,id,b'int',NO,PRI,,
2,date,b'date',YES,,,
3,reviewer_id,b'int',YES,MUL,,
4,comments,b'varchar(6000)',YES,,,


### Insert Data to Review Table

Let's insert the review data to the table in MySQL.

### Query the Review Table

Let's check the number of review that contain missing value in the comment column (has no review text). **IS NULL** means that we want to filter data that has missing value. If you want to return data that is not missing value, you can use **IS NOT NULL**.

In [55]:
query = """
SELECT COUNT(*) as frequency 
FROM review 
WHERE comments IS NULL
"""

pd.read_sql(query, mydb)

Unnamed: 0,frequency
0,101


Let's check some review data that has no comment.

In [56]:
pd.read_sql("SELECT * FROM review WHERE comments IS NULL LIMIT 5", mydb)

Unnamed: 0,listing_id,id,date,reviewer_id,comments
0,13088479,81444067,2016-06-23,18142579,
1,982909,154334511,2017-05-24,9913702,
2,12484261,167188622,2017-07-06,1834042,
3,19615310,194418577,2017-09-17,112701064,
4,18395154,195827922,2017-09-21,62870530,


## Calendar

### Processing Calendar Data

Finally, we come to the last and the largest data that we have. This data contain daily availability, the room price and the minimum and maximum number of allowed night to stay. 

In [57]:
df_calendar = pd.read_csv('data/calendar.csv')

df_calendar.head()

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,4576362,2021-03-29,t,$90.00,$90.00,18.0,1125.0
1,1941719,2021-03-26,t,$119.00,$119.00,,
2,1941719,2021-03-27,f,$119.00,$119.00,4.0,1123.0
3,1941719,2021-03-28,f,$119.00,$119.00,4.0,1123.0
4,1941719,2021-03-29,f,$119.00,$119.00,4.0,1123.0


Let's change the `price` and `adjusted_price` column to proper numeric column.

In [58]:
for col in ['price', 'adjusted_price']:
    
    df_calendar[col] = list(map(lambda x: float(re.sub('[$,]', '', x)), df_calendar[col]))

Let's transform the `available` column to proper logical column and clean all string columns.

In [59]:
df_calendar['available'] = list(map(lambda x: True if x == "t" else False if x == "f" else None, df_calendar['available']))

df_calendar = clean_char(df_calendar)

Let's check the data type.

In [60]:
df_calendar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1601498 entries, 0 to 1601497
Data columns (total 7 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   listing_id      1601498 non-null  int64  
 1   date            1601498 non-null  object 
 2   available       1601498 non-null  bool   
 3   price           1601498 non-null  float64
 4   adjusted_price  1601498 non-null  float64
 5   minimum_nights  1601255 non-null  float64
 6   maximum_nights  1601255 non-null  float64
dtypes: bool(1), float64(4), int64(1), object(1)
memory usage: 74.8+ MB


### Create Calendar Table

Let's create the table in MySQL server. The calendar has no primary key (or you can create one if you want) and has foreign key that will connect to the listing table.

In [61]:
query = """
CREATE TABLE calendar (
listing_id INT,
date DATE,
available BOOLEAN,
price DECIMAL(16,5),
adjusted_price DECIMAL(16,5),
minimum_nights INT,
maximum_nights INT,
FOREIGN KEY (listing_id) REFERENCES listing(id)
)
"""

Let's check the description in our MySQL server.

In [62]:
pd.read_sql("DESCRIBE calendar", mydb)

Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,listing_id,b'int',YES,MUL,,
1,date,b'date',YES,,,
2,available,b'tinyint(1)',YES,,,
3,price,"b'decimal(16,5)'",YES,,,
4,adjusted_price,"b'decimal(16,5)'",YES,,,
5,minimum_nights,b'int',YES,,,
6,maximum_nights,b'int',YES,,,


### Insert Data to Calendar Table

Let's see how many rows that we have in this dataset.

In [63]:
print("Number of rows: {:,}".format(df_calendar.shape[0]))

Number of rows: 1,601,498


As you can see, our calendar data has more than 1 million rows. It may be risky to insert 1 million rows in a single time, so we will split the data into several batch. For example, here I split the data into 20 batch.

In [64]:
data_split = np.linspace(0, df_calendar.shape[0], 20).astype('int')

data_split[0:10]

array([     0,  84289, 168578, 252868, 337157, 421446, 505736, 590025,
       674314, 758604])

The first batch will contain data from the first row (0) to the 84289th row.

In [65]:
df_calendar.iloc[ data_split[0]:data_split[(0+1)] ]

Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,4576362,"""2021-03-29""",True,90.0,90.0,18.0,1125.0
1,1941719,"""2021-03-26""",True,119.0,119.0,,
2,1941719,"""2021-03-27""",False,119.0,119.0,4.0,1123.0
3,1941719,"""2021-03-28""",False,119.0,119.0,4.0,1123.0
4,1941719,"""2021-03-29""",False,119.0,119.0,4.0,1123.0
...,...,...,...,...,...,...,...
84284,5436010,"""2021-10-17""",True,60.0,60.0,7.0,180.0
84285,5436010,"""2021-10-18""",True,60.0,60.0,7.0,180.0
84286,5436010,"""2021-10-19""",True,60.0,60.0,7.0,180.0
84287,5436010,"""2021-10-20""",True,60.0,60.0,7.0,180.0


Let's start inserting the data, one batch at a time. This may take several minutes.

### Query the Calendar Table

Let's do some simple query to check if the data is properly inserted to MySQL.

Let's check the number of row.

In [66]:
pd.read_sql("SELECT COUNT(*) FROM calendar", mydb)

Unnamed: 0,COUNT(*)
0,1601498


Let's check the date with the most unavailability.

In [67]:
query = """
SELECT date, COUNT(*) as frequency
FROM calendar
WHERE available IS FALSE
GROUP BY date
ORDER BY COUNT(*) DESC
LIMIT 15
"""

pd.read_sql(query, mydb)

Unnamed: 0,date,frequency
0,2021-03-27,2430
1,2021-03-26,2176
2,2021-03-28,2052
3,2021-03-29,1973
4,2021-03-30,1882
5,2021-03-31,1820
6,2021-04-01,1676
7,2021-04-02,1654
8,2021-04-03,1572
9,2022-01-01,1494


# Closing

If you have done with doing some query with your MySQL server, don't forget to close the connection to the database.

In [68]:
# Closing connection to database
mydb.close()

I hope that the notebook has fulfilled the following goal that we have stated earlier:

- Create a database with MySQL server
- Create multiple table
- Insert data into SQL table 
- Write query to collect data from SQL database