In [1]:
import numpy as np
import pandas as pd 
import seaborn as sns
import sqlite3 as sql
from pandasql import sqldf
import matplotlib.pyplot as pl

# Summarizing data

- AGGREGATE FUNCTIONS\
AVG()\
SUM()\
MIN()\
MAX()\
COUNT()

- NUMERICAL FIELDS ONLY\
AVG()\
SUM()

- VARIOUS DATA TYPES\
MIN()\
MAX()\
COUNT()

In [2]:
films = pd.read_csv('films_copy.csv',index_col=0)

In [3]:
films_sql = '''SELECT * 
FROM films'''
sqldf(films_sql,env=None)

Unnamed: 0,id,title,release_year,country,duration,language,certification,gross,budget
0,1,Intolerance: Love's Struggle Throughout the Ages,1916.0,USA,123.0,,Not Rated,,385907.0
1,2,Over the Hill to the Poorhouse,1920.0,USA,110.0,,,3000000.0,100000.0
2,3,The Big Parade,1925.0,USA,151.0,,Not Rated,,245000.0
3,4,Metropolis,1927.0,Germany,145.0,German,Not Rated,26435.0,6000000.0
4,5,Pandora's Box,1929.0,Germany,110.0,German,Not Rated,9950.0,
...,...,...,...,...,...,...,...,...,...
4963,4964,Unforgotten,,UK,45.0,English,,,
4964,4965,Wings,,USA,30.0,English,,,
4965,4966,Wolf Creek,,Australia,,English,,,
4966,4967,Wuthering Heights,,UK,142.0,English,,,


# Practice with aggregate functions
Now let's try extracting summary information from a table using these new aggregate functions. Summarizing is helpful in real life when extracting top-line details from your dataset. Perhaps you'd like to know how old the oldest film in the films table is, what the most expensive film is, or how many films you have listed.\
Now it's your turn to get more insights about the films table!

In [5]:
# Query the sum of film durations
total_duration = '''
SELECT SUM(duration) AS total_duration
FROM films;'''
sqldf(total_duration,env=None)

Unnamed: 0,total_duration
0,534882.0


In [6]:
# Calculate the average duration of all films
average_duration = '''
SELECT AVG(duration) AS average_duration
FROM films;'''
sqldf(average_duration,env=None)

Unnamed: 0,average_duration
0,107.947931


In [7]:
# Find the latest release_year
latest_year = '''SELECT MAX(release_year) AS latest_year
FROM films;'''
sqldf(latest_year,env=None)

Unnamed: 0,latest_year
0,2016.0


In [8]:
# Find the duration of the shortest film
shortest_film = '''
SELECT MIN(duration) AS shortest_film
FROM films;'''
sqldf(shortest_film,env=None)

Unnamed: 0,shortest_film
0,7.0


- Well done! You'll find yourself using aggregate functions over and over again to get a quick grasp of the data in a SQL database.

# Summarizing subsets

# Combining aggregate functions with WHERE
When combining aggregate functions with WHERE, you get a powerful tool that allows you to get more granular with your insights, for example, to get the total budget of movies made from the year 2010 onwards.\
This combination is useful when you only want to summarize a subset of your data. In your film-industry role, as an example, you may like to summarize each certification category to compare how they each perform or if one certification has a higher average budget than another.

Let's see what insights you can gain about the financials in the dataset.

In [18]:
# Calculate the sum of gross from the year 2000 or later
total_gross = '''SELECT title, release_year, SUM(gross) AS total_gross
FROM films
WHERE release_year >= 2000;'''
sqldf(total_gross,env=None)

Unnamed: 0,title,release_year,total_gross
0,102 Dalmatians,2000.0,150900900000.0


In [13]:
# Calculate the average gross of films that start with A
avg_gross_A = '''
SELECT title, AVG(gross) AS avg_gross_A
FROM films
WHERE title LIKE ('A%');'''
sqldf(avg_gross_A,env=None)

Unnamed: 0,title,avg_gross_A
0,A Farewell to Arms,47893240.0


In [17]:
# Calculate the lowest gross film in 1994
lowest_gross = '''
SELECT title, release_year, MIN(gross) AS lowest_gross
FROM films
WHERE release_year = 1994;'''
sqldf(lowest_gross,env=None)

Unnamed: 0,title,release_year,lowest_gross
0,There Goes My Baby,1994.0,125169.0


In [16]:
# Calculate the lowest gross film in 1994
highest_gross = '''
SELECT title, release_year, MAX(gross) AS highest_gross
FROM films
WHERE release_year BETWEEN 2000 AND 2012;'''
sqldf(highest_gross,env=None)

Unnamed: 0,title,release_year,highest_gross
0,Avatar,2009.0,760505847.0


- Nice. SQL provides us with several building blocks that we can combine in all kinds of ways, hence the name: Structured Query Language.

# Using ROUND()
Aggregate functions work great with numerical values; however, these results can sometimes get unwieldy when dealing with long decimal values. Luckily, SQL provides you with the ROUND() function to tame these long decimals.\
If asked to give the average budget of your films, ten decimal places is not necessary. Instead, you can round to two decimal places to create results that make more sense for currency.

Now you try!

In [19]:
reviews = pd.read_csv('reviews_copy.csv', index_col=0)

In [22]:
r = '''SELECT *
FROM reviews;'''
sqldf(r, env=None)

Unnamed: 0,film_id,num_user,num_critic,imdb_score,num_votes,facebook_likes
0,3405,285.0,267.0,6.4,149998,0
1,478,65.0,29.0,3.2,8465,491
2,74,83.0,25.0,7.6,7071,930
3,1254,1437.0,224.0,8.0,241030,13000
4,740,111.0,64.0,6.4,64742,0
...,...,...,...,...,...,...
4962,4801,2.0,6.0,7.0,75,121
4963,4264,514.0,488.0,7.0,181472,58000
4964,4356,85.0,119.0,6.2,29738,12000
4965,430,118.0,38.0,5.9,29591,0


In [21]:
# Round the average number of facebook_likes to one decimal place
avg_facebook_likes = '''
SELECT ROUND(AVG(facebook_likes), 1) AS avg_facebook_likes
FROM reviews;'''
sqldf(avg_facebook_likes,env=None)

Unnamed: 0,avg_facebook_likes
0,7795.2


- Well done! The average, rounded facebook_likes is 7802.9. This insight can be used as a benchmark to measure film reviews; any film with over 7802.9 likes can be considered popular.

# ROUND() with a negative parameter
A useful thing you can do with ROUND() is have a negative number as the decimal place parameter. This can come in handy if your manager only needs to know the average number of facebook_likes to the hundreds since granularity below one hundred likes won't impact decision making.

Social media plays a significant role in determining success. If a movie trailer is posted and barely gets any likes, the movie itself may not be successful. Remember how 2020's "Sonic the Hedgehog" movie got a revamp after the public saw the trailer?

Let's apply this to other parts of the dataset and see what the benchmark is for movie budgets so, in the future, it's clear whether the film is above or below budget.

In [25]:
#Calculate the average budget rounded to the thousands
avg_budget_thousands = '''
SELECT ROUND(AVG(budget), -3) AS avg_budget_thousands
FROM films;'''
sqldf(avg_budget_thousands,env=None)

Unnamed: 0,avg_budget_thousands
0,39902826.0


In [26]:
avg_budget_thousands = '''
SELECT ROUND(AVG(budget), 1) AS avg_budget_thousands
FROM films;'''
sqldf(avg_budget_thousands,env=None)

Unnamed: 0,avg_budget_thousands
0,39902826.3


- ROUND() of applause! The ROUND() function is very handy when making financial calculations to get a top-level view or specify to the penny or cent.

# Aliasing and arithmetic

- Arithmetic\
+, -, *, and /\

SELECT (4 + 3);

SELECT (4 - 3);

SELECT (4 * 3);

SELECT (4 / 3);

In [27]:
people = pd.read_csv('people_copy.csv',index_col=0)

In [28]:
people_sql = '''SELECT * 
FROM people;'''
sqldf(people_sql,env=None)

Unnamed: 0,id,name,birthdate,deathdate
0,1,50 Cent,1975-07-06,
1,2,A. Michael Baldwin,1963-04-04,
2,3,A. Raven Cruz,,
3,4,A.J. Buckley,1978-02-09,
4,5,A.J. DeLucia,,
...,...,...,...,...
8392,8393,Zohra Segal,1912-04-27,2014-07-10
8393,8394,Zooey Deschanel,1980-01-17,
8394,8395,Zoran Lisinac,,
8395,8396,Zubaida Sahar,,


# Aliasing with functions
Aliasing can be a lifesaver, especially as we start to do more complex SQL queries with multiple criteria. Aliases help you keep your code clean and readable. For example, if you want to find the MAX() value of several fields without aliasing, you'll end up with the result with several columns called max and no idea which is which. You can fix this with aliasing.

Now, it's over to you to clean up the following queries.

In [29]:
f = '''SELECT * 
FROM films;'''
sqldf(f,env=None).head()

Unnamed: 0,id,title,release_year,country,duration,language,certification,gross,budget
0,1,Intolerance: Love's Struggle Throughout the Ages,1916.0,USA,123.0,,Not Rated,,385907.0
1,2,Over the Hill to the Poorhouse,1920.0,USA,110.0,,,3000000.0,100000.0
2,3,The Big Parade,1925.0,USA,151.0,,Not Rated,,245000.0
3,4,Metropolis,1927.0,Germany,145.0,German,Not Rated,26435.0,6000000.0
4,5,Pandora's Box,1929.0,Germany,110.0,German,Not Rated,9950.0,


In [31]:
# Calculate the title and duration_hours from films
duration_hours = '''
SELECT title, (duration / 60.0) AS duration_hours
FROM films;'''
sqldf(duration_hours,env=None)

Unnamed: 0,title,duration_hours
0,Intolerance: Love's Struggle Throughout the Ages,2.050000
1,Over the Hill to the Poorhouse,1.833333
2,The Big Parade,2.516667
3,Metropolis,2.416667
4,Pandora's Box,1.833333
...,...,...
4963,Unforgotten,0.750000
4964,Wings,0.500000
4965,Wolf Creek,
4966,Wuthering Heights,2.366667


In [32]:
# Calculate the percentage of people who are no longer alive
percentage_dead = '''
SELECT COUNT(deathdate) * 100.0 / COUNT(*) AS percentage_dead
FROM people;'''
sqldf(percentage_dead,env=None)

Unnamed: 0,percentage_dead
0,9.372395


In [33]:
# Find the number of decades in the films table
number_of_decades = '''
SELECT (MAX(release_year) - MIN(release_year)) / 10.0 AS number_of_decades
FROM films;'''
sqldf(number_of_decades,env=None)

Unnamed: 0,number_of_decades
0,10.0


- Amazing work mastering arithmetic, aggregate functions, and aliasing! Now you know that our films table covers films released over one hundred years!

# Rounding results
You found some valuable insights in the previous exercise, but many of the results were inconveniently long. We forgot to round! We won't make you redo them all; however, you'll update the worst offender in this exercise.

In [36]:
# Round duration_hours to two decimal places
duration_hours = '''
SELECT title, ROUND(duration / 60.0, 2) AS duration_hours
FROM films;'''
sqldf(duration_hours,env=None)

Unnamed: 0,title,duration_hours
0,Intolerance: Love's Struggle Throughout the Ages,2.05
1,Over the Hill to the Poorhouse,1.83
2,The Big Parade,2.52
3,Metropolis,2.42
4,Pandora's Box,1.83
...,...,...
4963,Unforgotten,0.75
4964,Wings,0.50
4965,Wolf Creek,
4966,Wuthering Heights,2.37


- That's better! Now you can clearly see how long a movie is.