#Final Project
##Focus + Goal
When a movie succeeds at the box office, who should get all the credits? Is it the memorable actors who brought fictional characters to life? Is it the talented directors who transformed their vision into valuable lessons? Or is it the writers who masterfully manipulated the audience's emotions? Regardless, it can be incredibly difficult to narrow down and pinpoint exactly what factor most heavily influences a movie's success. As a result, with this project, we want to investigate how different factors pre and post-production correlate to a movie's commercial success, as measured by its total revenue  

##Central Question
How do factors such as rating, domestic lifetime gross, foreign lifetime gross and run-time affect a movie’s performance as measured by worldwide lifetime revenue (gross) and a movie's overall performance on the charts as measured by its ranking.
##DataSets
For this project, we plan to use 3 primary datasets. These are scraped from 3 separate websites, which are
1. The Movie Database (TMDB) - https://www.themoviedb.org/?language=en-US
`NOTE: id.csv Google Drive Link: https://drive.google.com/drive/u/0/folders/1BSrN6WlFonKPqdtvEV1_6OIcTGsA_gXA
2. Box Office Mojo - https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW
3. IMDB - https://www.imdb.com/chart/top/

In [None]:
#IMPORT LIBRARIES
import requests
from lxml import etree
import io
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from statistics import mean
import requests
import json
pd.options.display.max_columns = 30
from pandas._libs.lib import generate_slices
import sqlite3 as sql
import sqlalchemy
sql.register_adapter(np.int64, lambda val: int(val))
sql.register_adapter(np.int32, lambda val: int(val))

##Web Scraping - TMDB

First, we scrape the TMDB website. In order to get the movie dataset in `JSON` format, we need to use the `Discover` API from the MovieDB website by requesting our own API key from the server and form a query in order to gain access to the information that we need to analyze. (Movie data from 1980 to 2022)

In [None]:
api_key  = "api_key=d3544fa6fe1bca97f387774d47ed2f15"
discover_api = "https://api.themoviedb.org/3/discover/movie?"
query = "&primary_release_date.gte=2000-01-01&primary_release_date.lte=2022-01-01"

After having requested the content of the page using the `requests` module, we create a dataframe of that.

In [None]:
url= discover_api+api_key+query
movie_data=requests.get(url).json()
pd.DataFrame(movie_data)

Unnamed: 0,page,results,total_pages,total_results
0,1,"{'adult': False, 'backdrop_path': '/7ABsaBkO1j...",18889,377775
1,1,"{'adult': False, 'backdrop_path': '/14QbnygCuT...",18889,377775
2,1,"{'adult': False, 'backdrop_path': '/3rCuqnCQP7...",18889,377775
3,1,"{'adult': False, 'backdrop_path': '/7ysBoVZhLP...",18889,377775
4,1,"{'adult': False, 'backdrop_path': None, 'genre...",18889,377775
5,1,"{'adult': False, 'backdrop_path': None, 'genre...",18889,377775
6,1,"{'adult': False, 'backdrop_path': '/geYUecpFI2...",18889,377775
7,1,"{'adult': False, 'backdrop_path': '/qjGrUmKW78...",18889,377775
8,1,"{'adult': False, 'backdrop_path': '/3G1Q5xF40H...",18889,377775
9,1,"{'adult': False, 'backdrop_path': '/jlGmlFOcfo...",18889,377775


After this, we focus on analyzing the `results` column so that we can get the information of movies from 1980 to 2022.

In [None]:
movie=pd.DataFrame(movie_data["results"])
movie

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/7ABsaBkO1jA2psC8Hy4IDhkID4h.jpg,"[28, 12, 14, 878]",19995,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",699.52,/jRXYjXNq0Cs2TcJjLkki24MLp7u.jpg,2009-12-15,Avatar,False,7.5,26715
1,False,/14QbnygCuTO0vl7CAFmPf1fgZfV.jpg,"[28, 12, 878]",634649,en,Spider-Man: No Way Home,Peter Parker is unmasked and no longer able to...,416.358,/uJYYizSuA9Y3DCs0qS4qWvHfZg4.jpg,2021-12-15,Spider-Man: No Way Home,False,8.0,15998
2,False,/3rCuqnCQP7tZJ1rqqzSwxCzcW0w.jpg,"[18, 10749]",247136,ja,Ｍ家の新妻　変態洗礼,"Mikage will get married to Youiti next year, s...",359.864,/2oVfD5rUV2EElbQ11ds2Vf5nRaZ.jpg,2009-03-27,The Temptation of Kimono,False,5.4,7
3,False,/7ysBoVZhLPipvyZ8gyS9qvnPjUc.jpg,[18],795514,en,The Fallout,"In the wake of a school tragedy, Vada, Mia and...",331.519,/4ByHl9XRKR2iXbvF0ZilMRD1RcL.jpg,2021-03-17,The Fallout,False,7.5,425
4,False,,[10749],485470,ko,착한 형수2,"If you give it once, a good brother-in-law who...",327.341,/3pEs4hmeHvTAsmx09whEaPDOQpq.jpg,2017-10-08,Nice Sister-In-Law 2,False,6.0,2
5,False,,[27],888838,en,The Long Dark Trail,After two impoverished teenage brothers manage...,325.087,/ebdDGnqQXDGfiggHSazaWCLF6Lf.jpg,2021-06-19,The Long Dark Trail,False,5.4,7
6,False,/geYUecpFI2AonDLhjyK9zoVFcMv.jpg,"[16, 28, 14]",810693,ja,劇場版 呪術廻戦 0,Yuta Okkotsu is a nervous high school student ...,281.588,/3pTwMUEavTzVOh6yLN0aEwR7uSy.jpg,2021-12-24,Jujutsu Kaisen 0,False,8.3,686
7,False,/qjGrUmKW78MCFG8PTLDBp67S27p.jpg,"[16, 28, 12, 14]",635302,ja,劇場版「鬼滅の刃」無限列車編,"Tanjirō Kamado, joined with Inosuke Hashibira,...",247.391,/h8Rb9gBr48ODIwYUttZNYeMWeUU.jpg,2020-10-16,Demon Slayer -Kimetsu no Yaiba- The Movie: Mug...,False,8.3,2807
8,False,/3G1Q5xF40HkUBJXxt2DQgQzKTp5.jpg,"[16, 35, 10751, 14]",568124,en,Encanto,"The tale of an extraordinary family, the Madri...",234.609,/4j0PNHkMr5ax3IA8tjtxcmPU3QT.jpg,2021-10-13,Encanto,False,7.7,7634
9,False,/jlGmlFOcfo8n5tURmhC7YVd4Iyy.jpg,"[28, 35, 12]",436969,en,The Suicide Squad,"Supervillains Harley Quinn, Bloodsport, Peacem...",228.543,/kb4s0ML0iVZlG6wAKbbs9NAm6X.jpg,2021-07-28,The Suicide Squad,False,7.6,6783


Notice how we are only examining the content of one of page instead of every single one of them. Because of that we need to create an automatic process so that we could combine the results of all the pages into a single dataset. This calls for the need to use another API of TMDB being the `movie` API which gives us a more detailed version of the dataset above.

The limitation of using the `movie` API is that we need to manually access each movie using its unique ID in order to get the information needed. Therefore, we create another CSV file (named `id.csv`) containing all of the unique ID of over 4500 movies (and movies franchises) for this part of our analysis using the `movie` API. You can access the CSV file by following the link: https://drive.google.com/drive/u/0/folders/1BSrN6WlFonKPqdtvEV1_6OIcTGsA_gXA

In [None]:

movie_id=pd.read_csv("id.csv") #read the id dataset
basic_url = 'https://api.themoviedb.org/3/movie/{}?{}'  #access the movie API
movie_id

Unnamed: 0,id
0,862
1,8844
2,15602
3,31357
4,11862
...,...
4496,36221
4497,26215
4498,64559
4499,28123


Then, we create a for loop to access the information for each of the 4500 movies and store that information  in a list then read that list as a Pandas dataframe.

In [None]:
json_list = []
movie_id_l=movie_id['id'].tolist()
for movie in movie_id_l:
    url = basic_url.format(movie, api_key)
    r = requests.get(url)
    if r.status_code != 200:
        continue
    else:
        data = r.json()
        json_list.append(data)
df = pd.DataFrame(json_list)





In [None]:
requests.get(basic_url.format(0, api_key)).status_code #test case for non-existence movie ID

404

For this part of our analysis, we focus on `"title", "id", "revenue", "genres", "belongs_to_collection", "runtime"` in the dataset and we sort the values by revenue in descending order.

In [None]:
movie = df.loc[:, ["title", "revenue", "runtime"]].sort_values(by = "revenue", ascending = False)
movie

Unnamed: 0,title,revenue,runtime
1619,Titanic,2187463944,194
2487,Star Wars: Episode I - The Phantom Menace,924317558,136
474,Jurassic Park,920100000,127
754,Independence Day,817400891,145
1055,E.T. the Extra-Terrestrial,792965500,115
...,...,...,...
2752,Alvarez Kelly,0,106
2753,And the Ship Sails On,0,132
2755,Gulliver's Travels,0,76
948,Small Wonders,0,77


After having sorted the dataframe, we save it as a `JSON` file named `movies.json` in order to load it into the database to parse the necessary information we need for analysis.

In [None]:
movie.to_json("movies.json", orient = "records") #store the data as a json file

In [None]:
with open("movies.json") as f: #open the json file
    data = json.load(f)

Next, we use `json_normalize` to flatten the `JSON` file  into Pandas DataFrames to examine whether the formatting is correct or not.

In [None]:
pd.json_normalize(data)

Unnamed: 0,title,revenue,runtime
0,Titanic,2187463944,194
1,Star Wars: Episode I - The Phantom Menace,924317558,136
2,Jurassic Park,920100000,127
3,Independence Day,817400891,145
4,E.T. the Extra-Terrestrial,792965500,115
...,...,...,...
4489,Alvarez Kelly,0,106
4490,And the Ship Sails On,0,132
4491,Gulliver's Travels,0,76
4492,Small Wonders,0,77


Our examination leads us to see how many generes and title there are in those 4500 movies. The results indicate that there are over 10791 rows, showing that in the 4500 movies above, there are films with many types of genres.

## Web Scraping - IMDB

Now, using Beautiful Soup, we will scrape IMDB's Top 250 Movies page for information about movie ratings. First, let us acquire the page. To do so, we submit a GET request, passing the url of the page to get a Response object.

In [None]:
url = 'https://www.imdb.com/chart/top/'
response = requests.get(url)
response
assert response.status_code == 200


Then, we call `response.content` to get a list of its children elements, subsequently passing it to the Beautiful Soup constructor to get the Beautiful Soup object `soup`, which represents the HTML source as a nested data structure

In [None]:
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Top 250 Movies - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/chart/top/" rel="canonical"/>
<meta content="http://www.imdb.com/chart/top/" property="og:url">
<script>
    if (typeof uet == 'function') {
      uet("bb", "Lo

After having successfully imported the HTML to our notebook, we can begin parsing it. First, we create a data frame with 3 columns as specified in the code cell. This dataframe will store the information we scraped from the webpage.





In [None]:
IMDB_DF = pd.DataFrame(columns=['Title','Release Year','Rating'])

To scrape information on title, year, genre, director, and rating, we use the `.find()` method. The information we are interested is nested in the tbody tag. Hence, to obtain this section, we use `.find()` to search for the first occurrence of `tbody`, then we use `find_all()` to find all `tr` tags within `tbody`

In [None]:
movies = soup.find('tbody', class_="lister-list").find_all('tr')

Next, we use a for loop to iterate through each of the `tr` tags within `tbody`. Because information about each movie is contained in a separate `tr` tag, this process can be understood as going through a list of movies to read the information of each one.

Each category is scraped as follows:
1. **Movie Title**: We use `movie.find('td',class_="titleColumn").find('a')` to access the name of the movie, which is in the `a` tag, the children of the `td` tag. We call `.text` to display the text in the `a` tag.
2. **Year**: We use `movie.find('td',class_="titleColumn").find('span')` to acces the year the movie was released, which is in the `span` tag, a children of the `td` tag. We call `.text` to display text data, and .replace to remove the parentheses in the output. For example, (2015) becomes 2015.
5. **Rating**: We use `movie.find('td',class_='ratingColumn imdbRating')` to access the rating of the movie in the `td` tag whose attribute is `"ratingColumn imdbRating" `. Then, we use `.text` to display the text and `.strip()` to remove whitespace and newline characters.
Then, we add the information into the IMDB_DF dataframe. Additionally, we perform extra data cleaning by dropping NaN values and converting Rating and Release Year columns into numbers instead of strings.   

In [None]:
for movie in movies:
  title = movie.find('td',class_="titleColumn").find('a').text
  year = movie.find('td',class_="titleColumn").find('span').text.replace("(","").replace(")","")
  rating = movie.find('td',class_='ratingColumn imdbRating').text.strip()
  #print(title,year,rating)
  IMDB_DF = IMDB_DF.append({'Title':title,'Release Year':year,'Rating':rating},ignore_index=True)

IMDB_DF = IMDB_DF.dropna()
IMDB_DF['Rating'] = pd.to_numeric(IMDB_DF['Rating'], errors='coerce')
IMDB_DF['Release Year'] = pd.to_numeric(IMDB_DF['Release Year'], errors='coerce')
IMDB_DF.head(20)

Unnamed: 0,Title,Release Year,Rating
0,The Shawshank Redemption,1994,9.2
1,The Godfather,1972,9.2
2,The Dark Knight,2008,9.0
3,The Godfather Part II,1974,9.0
4,12 Angry Men,1957,9.0
5,Schindler's List,1993,8.9
6,The Lord of the Rings: The Return of the King,2003,8.9
7,Pulp Fiction,1994,8.8
8,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
9,"The Good, the Bad and the Ugly",1966,8.8


##Web Scraping - Box Office Mojo

Lastly, we will use XPATH to scrape Box Office Mojo's website. First, we acquire the page by submitting a GET request and passing the url of the page to get a Response object. We then check whether the response object yields status code 200 or not.

In [None]:
url = 'https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW'
response = requests.get(url)
assert response.status_code == 200

Next, we use the htmlparser to get the parse tree `tree1`. Then, we get its root `root1` by calling the method `getroot()`. Subsequently, we parse the tree to get the table `node_table` by specifying the path `"body/div//table"` as the parameter to the xpath method.

In [None]:
htmlparser = etree.HTMLParser()
tree1 = etree.parse(io.BytesIO(response.content), htmlparser)
root1 = tree1.getroot()

node_table = root1.xpath("body/div//table")

We now wish to create a List of Lists representation of the table, in order to build a dataframe. To do so, we must first create a list of every item in the table. This can be achieved by iterating through `node_table[0]`, appending each item if it is not None to `rows`.

In [None]:
#Create LoL
rows = []
for item in node_table[0].iter():
  if (item.text) != None:
   rows.append(item.text.strip().replace("$","").replace("%",""))
rows

['Rank',
 'Title',
 'Worldwide Lifetime Gross',
 'Domestic Lifetime Gross',
 'Domestic ',
 'Foreign Lifetime Gross',
 'Foreign ',
 'Year',
 '1',
 'Avatar',
 '2,922,917,914',
 '785,221,649',
 '26.9',
 '2,137,696,265',
 '73.1',
 '2009',
 '2',
 'Avengers: Endgame',
 '2,797,501,328',
 '858,373,000',
 '30.7',
 '1,939,128,328',
 '69.3',
 '2019',
 '3',
 'Titanic',
 '2,201,647,264',
 '659,363,944',
 '30',
 '1,542,283,320',
 '70',
 '1997',
 '4',
 'Star Wars: Episode VII - The Force Awakens',
 '2,069,521,700',
 '936,662,225',
 '45.3',
 '1,132,859,475',
 '54.7',
 '2015',
 '5',
 'Avengers: Infinity War',
 '2,048,359,754',
 '678,815,482',
 '33.1',
 '1,369,544,272',
 '66.9',
 '2018',
 '6',
 'Spider-Man: No Way Home',
 '1,916,306,995',
 '814,115,070',
 '42.5',
 '1,102,191,925',
 '57.5',
 '2021',
 '7',
 'Jurassic World',
 '1,671,537,444',
 '653,406,625',
 '39.1',
 '1,018,130,819',
 '60.9',
 '2015',
 '8',
 'The Lion King',
 '1,663,250,487',
 '543,638,043',
 '32.7',
 '1,119,612,444',
 '67.3',
 '2019',
 

Using list comprehension, we create a `LoL` from `rows`

In [None]:
count = 0
length = 8
LoL = [
    rows[i:i+length] for i in range(0,len(rows),length)
]
LoL

Finally, we build the `BoxOffice` dataframe and set its index to be the Rank of the movie. Additionally, we also convert rows with numbers from string to numeric format.

In [None]:
BoxOffice = pd.DataFrame(LoL)
BoxOffice.columns = BoxOffice.iloc[0]
BoxOffice = BoxOffice[1:]
BoxOffice = BoxOffice.set_index(BoxOffice.columns[0])

#Convert rows to numeric format
BoxOffice = BoxOffice.dropna()
BoxOffice['Worldwide Lifetime Gross'] = pd.to_numeric(BoxOffice['Worldwide Lifetime Gross'].str.replace(',',''), errors='coerce')
BoxOffice['Domestic Lifetime Gross'] = pd.to_numeric(BoxOffice['Domestic Lifetime Gross'].str.replace(',',''), errors='coerce')
BoxOffice['Domestic '] = pd.to_numeric(BoxOffice['Domestic '], errors='coerce')
BoxOffice['Foreign Lifetime Gross'] = pd.to_numeric(BoxOffice['Foreign Lifetime Gross'].str.replace(',',''), errors='coerce')
BoxOffice['Foreign '] = pd.to_numeric(BoxOffice['Foreign '], errors='coerce')
BoxOffice['Year'] = pd.to_numeric(BoxOffice['Year'], errors='coerce')#convert to integers
BoxOffice.head(20)

Unnamed: 0_level_0,Title,Worldwide Lifetime Gross,Domestic Lifetime Gross,Domestic,Foreign Lifetime Gross,Foreign,Year
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Avatar,2922917914,785221649.0,26.9,2137696265,73.1,2009
2,Avengers: Endgame,2797501328,858373000.0,30.7,1939128328,69.3,2019
3,Titanic,2201647264,659363944.0,30.0,1542283320,70.0,1997
4,Star Wars: Episode VII - The Force Awakens,2069521700,936662225.0,45.3,1132859475,54.7,2015
5,Avengers: Infinity War,2048359754,678815482.0,33.1,1369544272,66.9,2018
6,Spider-Man: No Way Home,1916306995,814115070.0,42.5,1102191925,57.5,2021
7,Jurassic World,1671537444,653406625.0,39.1,1018130819,60.9,2015
8,The Lion King,1663250487,543638043.0,32.7,1119612444,67.3,2019
9,The Avengers,1518815515,623357910.0,41.0,895457605,59.0,2012
10,Furious 7,1515341399,353007020.0,23.3,1162334379,76.7,2015


##How we plan to use the data to answer the central questions

We intend to answer our central question by finding the correlation between each indicator as adressed in the central question and revenue by utlizing everything we have found in all three datasets.



### 1. Rating

We need to find the correlation between revenue and rating.
- Use scatter plot in order to find the correlation strengh between the variables.
- Morerover, we can utlizie a linear regression model between these varialbes to validate the scatterplot.
- Lastly, we can find the Spearman Rank Correlation coefficient of these variables if we feel that a scatterplot is visually indicating a “might be monotonic, might be linear” relationship, our best bet would be to apply Spearman as it can be more comprehensive compared to Pearson.


### 2. Genre
- Besides determining the correlation between genere and revenue, we also need to determine which genre produce the highest grossing film using a bar chart.  

### 3. Distributor
- Determine which film studio has the highest revenue/budget ratio. Utilize histogram and box plot.  
- Investigate whether a film student has major influence in a film's box office success.


### 4. Director
- First, we need to investigate the correlation between director’s reputation and movie revenue.
- Next, we also need to find out which director is the most succesful in terms of box offcie success using a bar chart.

### 5. Run time
- Investigate the correlation between run time and movie revenue.
- Build linear regression model, graph scatter plot and determine Spearman Rank Correlation coefficient (similar to the processes above).
- Answer the sub question: Is there a good length that a movie needs to reach in order to maximize revenue or not?



#DataBase Design

Now that we have successfully scraped our datasets, we will move onto the next phase of the project, which is to store the data into a database. Because we are only working with 3 datasets, our database design is relatively straightforward. We will name our database `movie.db`, and within this database, we will have 3 tables: `TMDB`, `IMDB` and `BoxOfficeMojo`.

##Creating Tables in the Database

First, we write a function whose job is to create tables for the 3 datasets we just scraped.

In [None]:
def sql_create(db, cqry):

  conn = sql.connect(db)

  curr = conn.cursor()
  curr.execute(cqry)



Next, we specify the name of the table and the table columns for each dataset

In [None]:
db = "movie.db"
cqry = 'CREATE TABLE IMDB (title VARCHAR(100) NOT NULL, release INT, rating FLOAT DEFAULT 0.0)'
sql_create(db, cqry)


In [None]:

jqry='CREATE TABLE MOVIEDB (title VARCHAR(100) NOT NULL, revenue INT ,  runtime INT) '

sql_create(db,jqry)

In [None]:

jqry='CREATE TABLE BOXOFFICE(Rank INT, Title VAR(100) NOT NULL, WGross BIGINT, DGross BIGINT, DPercent FLOAT DEFAULT 0.0, FGross BIGINT, FPercent FLOAT DEFAULT 0.0, Year INT) '

sql_create(db,jqry)

##Populating Tables

First we convert each of the three dataframe to a list of tuples as follow:

In [None]:
LOT_IMDB=[tuple(x) for x in IMDB_DF.to_records(index=False)]
LOT_MOVIEDB=[tuple(y) for y in movie.to_records(index=False)]
LOT_BOXOFFICE=[tuple(z) for z in BoxOffice.to_records()]


Next, we proceed to create a function to insert each LOT into the correpsonding tables inside our `movie.db` database.

In [None]:

  def sql_insert(db, iqry,data):
      conn = sql.connect(db)
      curr = conn.cursor()
      curr.executemany(iqry,data)
      conn.commit()


Then we write queries to insert values of each column into the corresponding tables.

In [None]:
iqry_IMDB='INSERT INTO IMDB VALUES(?,?,?)'
iqry_MOVIEDB='INSERT INTO MOVIEDB VALUES(?,?,?)'
iqry_BOXOFFICE='INSERT INTO BOXOFFICE VALUES(?,?,?,?,?,?,?,?)'
sql_insert(db,iqry_IMDB,LOT_IMDB)
sql_insert(db,iqry_MOVIEDB,LOT_MOVIEDB)
sql_insert(db,iqry_BOXOFFICE,LOT_BOXOFFICE)



After that, we write a function to select the tables from our database for which we use to examine if we have correctly inserted data into the database.

In [None]:


def sql_select(db,sql_q):
      conn=sql.connect(db)
      cur=conn.cursor()
      result=cur.execute(sql_q)
      return result.fetchall()
sql_q1='SELECT * FROM BOXOFFICE'
sql_select(db,sql_q1)

[(1, 'Avatar', 2922917914, 785221649, 26.9, 2137696265, 73.1, 2009),
 (2, 'Avengers: Endgame', 2797501328, 858373000, 30.7, 1939128328, 69.3, 2019),
 (3, 'Titanic', 2201647264, 659363944, 30.0, 1542283320, 70.0, 1997),
 (4,
  'Star Wars: Episode VII - The Force Awakens',
  2069521700,
  936662225,
  45.3,
  1132859475,
  54.7,
  2015),
 (5,
  'Avengers: Infinity War',
  2048359754,
  678815482,
  33.1,
  1369544272,
  66.9,
  2018),
 (6,
  'Spider-Man: No Way Home',
  1916306995,
  814115070,
  42.5,
  1102191925,
  57.5,
  2021),
 (7, 'Jurassic World', 1671537444, 653406625, 39.1, 1018130819, 60.9, 2015),
 (8, 'The Lion King', 1663250487, 543638043, 32.7, 1119612444, 67.3, 2019),
 (9, 'The Avengers', 1518815515, 623357910, 41.0, 895457605, 59.0, 2012),
 (10, 'Furious 7', 1515341399, 353007020, 23.3, 1162334379, 76.7, 2015),
 (11, 'Top Gun: Maverick', 1487994195, 717994195, 48.2, 770000000, 51.8, 2022),
 (12, 'Frozen II', 1450026933, 477373578, 32.9, 972653355, 67.1, 2019),
 (13,
  'Av

##Challenges when storing data in DB
1. The data that we scraped from the web was converted to a dataframe. Therefore, in order to insert into a SQL database, we need to transform it into a list of tuples.
2. When we initally loaded the data into the database, some integer type columns were converted to BLOB (Binary Large Object) data type since they were larger then 8 bytes. This caused us a lot of headache, but after digging into documentation on the web, we came across a post on Stackoverflow (https://stackoverflow.com/questions/49456158/integer-in-python-pandas-becomes-blob-binary-in-sqlite) that was helpful. By referencing this post, we overcame this challenge by specifying two lines of code `sql_register_adapter` to enable Sqlite to read int64 data type.
