# Introduction
SQL (Structured Query Language) is a widely used and flexible language for accessing and manipulating data from databases. It allows users to perform various operations on data, such as filtering, sorting, grouping, aggregating, joining, and more. A dataframe is a type of database that is frequently used in Python programming. It is a data structure that stores data in a tabular format with rows and columns, similar to a spreadsheet. As a Python enthusiast and someone who is interested in learning SQL, I decided to translate SQL commands to python code that can work for dataframes. This way, I can use SQL syntax and logic to manipulate dataframes in Python and perform various data analysis tasks. 

The most important thing for exploring SQL is a database and the second thing is a cheat sheet. A database is where you store and retrieve data using SQL commands. A cheat sheet is a handy reference that summarizes the most common SQL commands and syntax. You can find a database here and a cheat sheet [here](https://www.datacamp.com/cheat-sheet/sql-basics-cheat-sheet). However, I encountered a problem when I tried to download the database that was mentioned in the cheat sheet link. It was not available for some reason, so I had to create one on my own. I used Python to process a relevant dataset to make a new database that has the same columns as in the cheat sheet. If you are interested in how I did that, please read the article.

In [1]:
# export database to dataframe
import pandas as pd
import sqlite3
conn = sqlite3.connect('Data/airbnb_listings.db')   
sql_query = pd.read_sql_query (''' SELECT *FROM airbnb_listings''', conn)
df = pd.DataFrame(sql_query)
conn.close() 

## Querying tables
### Get all the columns from a table
SELECT * 
FROM airbnb_listings;

In [2]:
df

Unnamed: 0,city,id,number_of_rooms,year_listed,country
0,Paris,281420,1,2011,France
1,Paris,3705183,1,2013,France
2,Paris,4082273,1,2014,France
3,Paris,4797344,1,2013,France
4,Paris,4823489,1,2014,France
...,...,...,...,...,...
500259,Paris,38338635,1,2015,France
500260,Paris,38538692,1,2013,France
500261,Paris,38683356,1,2012,France
500262,Paris,39659000,1,2015,France


### Return the city column from the table
SELECT city 
FROM airbnb_listings;

In [3]:
df['city']

0         Paris
1         Paris
2         Paris
3         Paris
4         Paris
          ...  
500259    Paris
500260    Paris
500261    Paris
500262    Paris
500263    Paris
Name: city, Length: 500264, dtype: object

### Get the city and year_listed columns from the table
SELECT city, year_listed
FROM airbnb_listings;

In [4]:
df[['city', 'year_listed']]

Unnamed: 0,city,year_listed
0,Paris,2011
1,Paris,2013
2,Paris,2014
3,Paris,2013
4,Paris,2014
...,...,...
500259,Paris,2015
500260,Paris,2013
500261,Paris,2012
500262,Paris,2015


### Get the listing id, city, ordered by the number_of_rooms in ascending order
SELECT city, year_listed 
FROM airbnb_listings 
ORDER BY number_of_rooms ASC;

In [5]:
df.sort_values(by=['number_of_rooms'],ascending=True)[['city', 'year_listed']]

Unnamed: 0,city,year_listed
0,Paris,2011
311712,Mexico City,2014
311711,Mexico City,2014
311710,Hong Kong,2012
311709,Bangkok,2017
...,...,...
331017,Istanbul,2019
331018,Istanbul,2019
189863,Bangkok,2020
331027,Istanbul,2019


### Get the listing id, city, ordered by the number_of_rooms in descending order
SELECT city, year_listed 
FROM airbnb_listings 
ORDER BY number_of_rooms DESC;

In [6]:
df.sort_values(by=['number_of_rooms'],ascending=False)[['city', 'year_listed']]

Unnamed: 0,city,year_listed
185388,Paris,2015
80885,Istanbul,2019
185387,Paris,2015
185386,Paris,2015
185385,Paris,2015
...,...,...
177470,Rio de Janeiro,2011
177469,Mexico City,2019
177468,Mexico City,2018
177467,Mexico City,2011


### Get the first 5 rows from airbnb_listings
SELECT * 
FROM airbnb_listings
LIMIT 5;

In [7]:
df.head(5)

Unnamed: 0,city,id,number_of_rooms,year_listed,country
0,Paris,281420,1,2011,France
1,Paris,3705183,1,2013,France
2,Paris,4082273,1,2014,France
3,Paris,4797344,1,2013,France
4,Paris,4823489,1,2014,France


## Filtering on numeric columns
### Get all the listings where number_of_rooms is more or equal to 3
SELECT *
FROM airbnb_listings 
WHERE number_of_rooms >= 3;

In [8]:
print(df[df['number_of_rooms']>= 3])

Unnamed: 0,city,id,number_of_rooms,year_listed,country
6701,Paris,1250253,3,2013,France
6702,Paris,1337659,3,2013,France
6703,Paris,2277264,3,2013,France
6704,Paris,2534901,3,2014,France
6705,Paris,2610464,3,2014,France
...,...,...,...,...,...
245336,Paris,33504430,3,2014,France
245337,Paris,34100085,3,2013,France
245338,Paris,34403561,3,2012,France
245339,Paris,35956764,3,2016,France


### Get all the listings where number_of_rooms is more than 3
SELECT *
FROM airbnb_listings 
WHERE number_of_rooms > 3;

In [9]:
df[df['number_of_rooms']> 3]

Unnamed: 0,city,id,number_of_rooms,year_listed,country
7051,Paris,417214,4,2012,France
7052,Paris,914727,4,2013,France
7053,Paris,2617460,4,2014,France
7054,Paris,2942218,4,2012,France
7055,Paris,3072377,4,2013,France
...,...,...,...,...,...
245054,Paris,35537386,4,2014,France
245055,Paris,35875964,4,2018,France
245056,Paris,37441080,5,2013,France
245057,Paris,40772446,4,2015,France


### Get all the listings where number_of_rooms is exactly 3
SELECT *
FROM airbnb_listings 
WHERE number_of_rooms = 3;

In [10]:
df[df['number_of_rooms']== 3]

Unnamed: 0,city,id,number_of_rooms,year_listed,country
6701,Paris,1250253,3,2013,France
6702,Paris,1337659,3,2013,France
6703,Paris,2277264,3,2013,France
6704,Paris,2534901,3,2014,France
6705,Paris,2610464,3,2014,France
...,...,...,...,...,...
245336,Paris,33504430,3,2014,France
245337,Paris,34100085,3,2013,France
245338,Paris,34403561,3,2012,France
245339,Paris,35956764,3,2016,France


### Filtering columns within a range—Get all the listings with 3 to 6 rooms
SELECT *
FROM airbnb_listings 
WHERE number_of_rooms BETWEEN 3 AND 6;

In [11]:
df[(df['number_of_rooms']>= 3) & (df['number_of_rooms']<= 6)]

Unnamed: 0,city,id,number_of_rooms,year_listed,country
6701,Paris,1250253,3,2013,France
6702,Paris,1337659,3,2013,France
6703,Paris,2277264,3,2013,France
6704,Paris,2534901,3,2014,France
6705,Paris,2610464,3,2014,France
...,...,...,...,...,...
245336,Paris,33504430,3,2014,France
245337,Paris,34100085,3,2013,France
245338,Paris,34403561,3,2012,France
245339,Paris,35956764,3,2016,France


## Filtering on text columns
### Get all the listings that are based in 'Paris'
SELECT * 
FROM airbnb_listings 
WHERE city = "Paris";

In [12]:
df[df['city']=='Paris']

Unnamed: 0,city,id,number_of_rooms,year_listed,country
0,Paris,281420,1,2011,France
1,Paris,3705183,1,2013,France
2,Paris,4082273,1,2014,France
3,Paris,4797344,1,2013,France
4,Paris,4823489,1,2014,France
...,...,...,...,...,...
250127,Paris,38338635,1,2015,France
250128,Paris,38538692,1,2013,France
250129,Paris,38683356,1,2012,France
250130,Paris,39659000,1,2015,France


### Filter one column on many conditions—Get the listings based in the 'USA' and in ‘France’
SELECT *
FROM airbnb_listings 
WHERE country IN ("USA", "France");

In [13]:
df[(df['country'] =="USA") | (df['country']=="France")]

Unnamed: 0,city,id,number_of_rooms,year_listed,country
0,Paris,281420,1,2011,France
1,Paris,3705183,1,2013,France
2,Paris,4082273,1,2014,France
3,Paris,4797344,1,2013,France
4,Paris,4823489,1,2014,France
...,...,...,...,...,...
250127,Paris,38338635,1,2015,France
250128,Paris,38538692,1,2013,France
250129,Paris,38683356,1,2012,France
250130,Paris,39659000,1,2015,France


### Get all listings where city starts with "j" and where it does not end with "t"
SELECT * 
FROM airbnb_listings 
WHERE city LIKE "r%" AND city NOT LIKE "%o";

In [14]:
# SQL searchs both lower case and upper case character
# pandas searchs lower case or upper case character
df[(df['city'].str[0]=='R') & (df['city'].str[-1]!='o')]

Unnamed: 0,city,id,number_of_rooms,year_listed,country
7131,Rome,5335850,1,2015,Italia
7136,Rome,10669776,1,2016,Italia
7137,Rome,18826436,1,2015,Italia
8939,Rome,33308717,1,2013,Italia
9110,Rome,5380831,4,2015,Italia
...,...,...,...,...,...
241495,Rome,12953708,1,2012,Italia
241496,Rome,24214949,2,2016,Italia
241957,Rome,12667601,3,2015,Italia
241958,Rome,29019775,2,2017,Italia


## Filtering on multiple columns
### Get all the listings in "Paris" where number_of_rooms is bigger than 3
SELECT *
FROM airbnb_listings 
WHERE city = "Paris" AND number_of_rooms > 3;

In [15]:
df[(df['city']=="Paris") & (df['number_of_rooms']>3)]

Unnamed: 0,city,id,number_of_rooms,year_listed,country
7051,Paris,417214,4,2012,France
7052,Paris,914727,4,2013,France
7053,Paris,2617460,4,2014,France
7054,Paris,2942218,4,2012,France
7055,Paris,3072377,4,2013,France
...,...,...,...,...,...
245054,Paris,35537386,4,2014,France
245055,Paris,35875964,4,2018,France
245056,Paris,37441080,5,2013,France
245057,Paris,40772446,4,2015,France


### Get all the listings in "Paris" OR the ones that were listed after 2012
SELECT * 
FROM airbnb_listings
WHERE city = 'Paris' OR year_listed > 2012;

In [16]:
df[(df['city']=="Paris") | (df['year_listed']>2012)]

Unnamed: 0,city,id,number_of_rooms,year_listed,country
0,Paris,281420,1,2011,France
1,Paris,3705183,1,2013,France
2,Paris,4082273,1,2014,France
3,Paris,4797344,1,2013,France
4,Paris,4823489,1,2014,France
...,...,...,...,...,...
250127,Paris,38338635,1,2015,France
250128,Paris,38538692,1,2013,France
250129,Paris,38683356,1,2012,France
250130,Paris,39659000,1,2015,France


## Filtering on missing data
### Get all the listings where number_of_rooms is missing
SELECT *
FROM airbnb_listings 
WHERE number_of_rooms IS NULL; 

In [17]:
df[df['number_of_rooms'].isna()]

Unnamed: 0,city,id,number_of_rooms,year_listed,country


### Get all the listings where number_of_rooms is not missing
SELECT *
FROM airbnb_listings 
WHERE number_of_rooms IS NOT NULL; 

In [18]:
df[(df['number_of_rooms'].notnull())]

Unnamed: 0,city,id,number_of_rooms,year_listed,country
0,Paris,281420,1,2011,France
1,Paris,3705183,1,2013,France
2,Paris,4082273,1,2014,France
3,Paris,4797344,1,2013,France
4,Paris,4823489,1,2014,France
...,...,...,...,...,...
250127,Paris,38338635,1,2015,France
250128,Paris,38538692,1,2013,France
250129,Paris,38683356,1,2012,France
250130,Paris,39659000,1,2015,France


## Simple aggregations
### Get the total number of rooms available across all listings 
SELECT SUM(number_of_rooms) 
FROM airbnb_listings; 

In [19]:
df['number_of_rooms'].sum()

379070

### Get the average number of rooms per listing across all listings
SELECT AVG(number_of_rooms) 
FROM airbnb_listings;

In [20]:
df['number_of_rooms'].mean()

1.515479826651528

### Get the listing with the highest number of rooms across all listings
SELECT MAX(number_of_rooms) 
FROM airbnb_listings;

In [21]:
df['number_of_rooms'].max()

50

### Get the listing with the lowest number of rooms across all listings
SELECT MIN(number_of_rooms) 
FROM airbnb_listings;

In [22]:
df['number_of_rooms'].min()

1

## Grouping, filtering, and sorting 
### Get the total number of rooms for each country
SELECT country, SUM(number_of_rooms)
FROM airbnb_listings
GROUP BY country;

In [23]:
df.groupby(['country'])['number_of_rooms'].sum()

country
Autralia        52364
Brazil          42426
France          70184
Hong Kong        7665
Italia          39958
Mexico          28886
South Africa    35706
Thailand        23807
Turkey          34123
USA             43951
Name: number_of_rooms, dtype: int64

### Get the average number of rooms for each country
SELECT country, AVG(number_of_rooms)
FROM airbnb_listings
GROUP BY country;

In [24]:
df.groupby(['country'])['number_of_rooms'].mean()

country
Autralia        1.667909
Brazil          1.707627
France          1.369097
Hong Kong       1.309585
Italia          1.492976
Mexico          1.500883
South Africa    2.018314
Thailand        1.382681
Turkey          1.518535
USA             1.316450
Name: number_of_rooms, dtype: float64

### For each country, get the average number of rooms per listing, sorted by ascending order
SELECT country, AVG(number_of_rooms) AS avg_rooms
FROM airbnb_listings
GROUP BY country
ORDER BY avg_rooms ASC;

In [25]:
# cannot create a column after using gropupby in dataframe
# SQL create new column to use for sorting. 
df.groupby(['country'],sort=False)['number_of_rooms'].mean()


country
France          1.369097
USA             1.316450
Thailand        1.382681
Brazil          1.707627
Autralia        1.667909
Turkey          1.518535
Italia          1.492976
Hong Kong       1.309585
Mexico          1.500883
South Africa    2.018314
Name: number_of_rooms, dtype: float64

### For Thailand and the USA, get the average number of rooms per listing in each country
SELECT country, AVG(number_of_rooms)
FROM airbnb_listings
WHERE country IN ("USA", "Thailand")
GROUP BY country;

In [26]:
df[(df['country']=="USA")|(df['country']=="Thailand")].groupby(['country'])['number_of_rooms'].mean()

country
Thailand    1.382681
USA         1.316450
Name: number_of_rooms, dtype: float64

### Get the number of cities per country, where there are listings
SELECT country, COUNT(city) AS number_of_cities
FROM airbnb_listings
GROUP BY country;

In [27]:
df.groupby(['country'])['city'].count()

country
Autralia        31395
Brazil          24845
France          51263
Hong Kong        5853
Italia          26764
Mexico          19246
South Africa    17691
Thailand        17218
Turkey          22471
USA             33386
Name: city, dtype: int64

### Get all the years where there were more than 100 listings per year
SELECT year_listed
FROM airbnb_listings
GROUP BY year_listed
HAVING COUNT(id) > 100;

In [28]:
f = df.groupby(['year_listed'])['id'].count()
f[f.values>100].keys()

Index([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020,
       2021],
      dtype='int64', name='year_listed')