# Introduction 
* Ttechnological advancements of the 21st century have allowed for streaming television shows and movies to become much more popular than purchasing DVDs or paying to view a movie in theaters. Additionally, streaming services continue to increase their subscription costs, leaving many people wondering _which streaming service is right for me?_ This project aims to answer this question by analyzing data from Netflix, Hulu, Disney+, and Amazon Prime Video against the top 1000 movies and TV shows on IMDB as of June 2021. While the person who loves a good variety of movies and TV shows would benefit from the streaming service that has the most of the best, perhaps someone who only likes to watch documentaries would want the streaming service that offers the largest selection of top documentaries. On the other hand, if someone is really big on reality tv shows, then they would want a subscription for the streaming service with more of the highly rated reality tv shows. Then of course there is the devil's advocate who wants to entertain the streaming service who has the least of the best. This project will be able to cater to everybody's unique tastes. 

In [21]:
# Import necessary libraries: 
import pandas as pd 
import numpy as np 

# Data Exploration
* Load the datasets into pandas DataFrames
* Explore the dataset using various pandas functions 
* Identify any missing values, outliers, or other issues that need to be addressed

In [22]:
# Load the datasets into pandas DataFrames
amazon_df = pd.read_csv("amazon_prime_titles.csv")
disney_df = pd.read_csv("disney_plus_titles.csv")
hulu_df = pd.read_csv("hulu_titles.csv")
netflix_df = pd.read_csv("netflix_titles.csv")
imdb_tv_df = pd.read_csv("imdb_1000_tvshows.csv")
imdb_movies_df = pd.read_csv("imdb_1000_movies.csv")

# Explore the datasets
print(amazon_df.columns)
print(disney_df.columns)
print(hulu_df.columns)
print(netflix_df.columns)
# the datasets all have the same columns, which will make it easy to remove columns and perform joins with the imdb datasets

# CHECK FOR MISSING VALUES

# amazon_df.isna().sum()
# disney_df.isna().sum()
# hulu_df.isna().sum()
# netflix_df.isna().sum()
# imdb_movies_df.isna().sum()
# imdb_tv_df.isna().sum()


Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')


ranking            0
series name\r      0
Year               0
certificate      170
runtime           25
genre              0
rating           450
DETAILS            8
ACTOR 1            2
ACTOR 2            2
ACTOR 3            6
ACTOR 4            7
VOTES              0
dtype: int64

### Columns with Missing Data in each dataset:
#### # of missing rows for the streaming services
|columns | Amazon | Disney+ | Hulu  | Netflix |
|--------|--------|---------|-------|---------|
|show_id |       0|      0  |    0  |     0   |
|type    |       0|      0  |    0  |     0   |
|title   |       0|      0  |    0  |     0   |
|director|    2082|    473  |  3070 |    2634 |   
|cast    |    1233|    190  |  3073 |    825  |
|country |    8996|    219  |  1453 |    831  |
|date_added|  9513|      3  |    28 |    10   |
|release_year|   0|      0  |    0  |     0   |
|rating      | 337|      3  |    520|     4   |
|duration    |   0|      0  |    479|     3   |
|listed_in   |   0|      0  |    0  |     0   |
|description |   0|      0  |    4  |     0   |

#### # missing rows for the IMDB movies & TV shows
|columns | movies | TV shows|
|--------|--------|---------|   
|ranking of movie |      0  |   0   |
|movie/series name|      0  |   0   |
|Year             |      0  |   0   |    
|certificate      |      5  |   170 |
|runtime          |      0  |   25  |
|genre            |      0  |   0   | 
|RATING           |      0  |   450 |
|metascore        |      163|   8   |
|DETAIL(S)        |     0   |   2   |   
|DIRECTOR         |      0  |   NA  |
|ACTOR 1          |      0  |   2   | 
|ACTOR 2          |      0  |   6   |
|ACTOR 3          |      0  |   7   |
|ACTOR 4          |      0  |   0   |
|votes            |      0  |   0   |
|GROSS COLLECTION |      180|   NA  |


# Data cleaning
* Address and missing values/outliers/etc. 
* Remove any unnecessary or irrelevant columns from the DataFrames
* Perform any desired data aggregation 

# Data Visualization
* Create visualizations (histograms, scatterplots, bar charts, etc.) to explore the relationships between variables in the dataset
* Use seaborn to create more complex visualizations (heatmaps, pairplots, etc.) to uncover patterns and trends in the data

# Data Analysis
* Use pandas and numpy functions to analyze the data and answer specific questions about the dataset
* Draw conclusions and make inferences based on the analysis (perform a hypothesis test!)

# Data Storytelling
* Use the insights gained from the data analysis to tell a compelling story about the dataset
* Incorporate visualizations and data points to support the narrative
* Make sure the story is clear, consice, and relevant to the audience

# Conclusion
* Summarize the key findings of the analysis and the story
* Reflect on any limitations or potential sources of bias in the analysis
* Discuss potential next steps for further analysis or investigation