# Studio: Working with Databases in Python

For today's studio, we will be using the [TV Shows dataset](https://www.kaggle.com/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney) from Kaggle. We have already downloaded the CSV for you.

You will use the watchlist you created to answer these questions:

1. **Which streaming services contain the shows you want to watch next?**
2. **Which streaming service is the best value based on the shows you want to watch?**

As you complete the different tasks in the studio, you may choose between using Pandas or SQL. 

**Remember**: we learned in our prep work that one is oftentimes more efficient at certain tasks than the other, so choose wisely!

## My Watchlist

If you would like, please use this space to make note of your watchlist by editing the text cell. You will need 10 shows overall.

1. Netflix Show 1
2. Netflix Show 2
3. Netflix Show 3
4. Netflix Show 4
5. Netflix Show 5
6. Netflix Show 6
7. Netflix Show 7
8. Netflix Show 8
9. Netflix Show 9
10. Netflix Show 10

## Database Setup

Import the necessary libraries and create a dataframe from the provided CSV. 

Print the info out for the dataframe. 

After that, you may drop the column called `Unnamed: 0` and rename any columns with spaces or unusual characters in the names such as `"Disney+"`. 

Print out the info for the dataframe again to ensure your changes were made.

In [37]:
# Code here
import pandas as pd
import sqlite3 as sql
import numpy as np

data = pd.read_csv("tv_shows.csv")
df = pd.DataFrame(data)
df = df.drop("Unnamed: 0",axis=1)
df = df.rename(columns={"Disney+":"Disney Plus"})


print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5368 entries, 0 to 5367
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               5368 non-null   int64 
 1   Title            5368 non-null   object
 2   Year             5368 non-null   int64 
 3   Age              3241 non-null   object
 4   IMDb             4406 non-null   object
 5   Rotten Tomatoes  5368 non-null   object
 6   Netflix          5368 non-null   int64 
 7   Hulu             5368 non-null   int64 
 8   Prime Video      5368 non-null   int64 
 9   Disney Plus      5368 non-null   int64 
 10  Type             5368 non-null   int64 
dtypes: int64(7), object(4)
memory usage: 461.4+ KB
None


With your dataframe at the ready, create a new database called `tv.db`. 

Add a new table to your database called `shows` using the data in the dataframe. 

In [42]:
# Code here
from pandas import read_sql_query


tv_db = sql.connect("tv.db")
df.to_sql(df,tv_db , if_exists='replace', index=False)

tv_db.execute("""
    create table shows as
    select * from df
"""
)
tv_db.close()

DatabaseError: Execution failed on sql '
        SELECT
            name
        FROM
            sqlite_master
        WHERE
            type IN ('table', 'view')
            AND name=?;
        ': Error binding parameter 1: type 'DataFrame' is not supported

With your new table and database set up, print out the top 20 records in the `shows` table.

In [None]:
# Code Here

Now, create a new table called `watchlist` that has three fields:
1. id -> data type of `INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT`
2. title -> data type of `TEXT`
3. importance_rank -> data type of `INTEGER`

For the `importance_rank` field, rank each of your watchlist shows based on how much you want to see them, `10` being the most important and `1` being the least important.

Then, insert each of the items from your watchlist into the new `watchlist` table, using the `executemany` method from our exercises.

Finally, select all the records from the `watchlist` table and print them out to the console.

In [None]:
# Code here

## Working with the Data

Using Pandas or SQL, find the answer to these 2 questions:
1. How many of the total shows (full csv list) are on each streaming service?
2. What percentage of these total shows is available on each streaming service?

**Hint**:

Use the pandas `query` method to filter the data, and then the Python `len` method to find it's length. [Relevant Link](https://www.geeksforgeeks.org/ways-to-filter-pandas-dataframe-by-column-values/)

In [None]:
# Code here


Now join your `watchlist` data to the `shows` data using pandas or SQL. Verify that you joined the data correctly.

Using this related dataset, come up with analytic code that answers these questions:
1. The number of watchlist shows each streaming service has
2. The percentage of your overall watchlist each streaming service has


In [None]:
# Code here

## Results

Now that you have done your analysis, make note of the answers to the following questions by editing the text cell:

1. Was every show on your watchlist in the Kaggle dataset? Do you have any ideas as to why a show might not have been present?

- *your_answer*

2. Did you include a show or shows in your watchlist that is exclusive to one of the platforms? How might that have impacted your analysis?

- *your_answer*

3. Which streaming service(s) offered the most shows on your watchlist? Which streaming service(s) offered the least?

- *your_answer*

4. Based on the shows you want to watch and the results of your analysis, is there a streaming service you think would be a good fit for you?

- *your_answer*

# Bonus Mission

We didn't end up using that `importance_rank` field, did we?

Well, that was intentional! 

Your bonus mission is to come up with analysis that uses that field to determine, based on watchlist show importance_rank and number of watchlist shows available on a service, which platform you should subscribe to.

In [None]:
# Code Here