# `Explain your motivation to pick this data and the business case to develop` :

I am working with this very dataset **(Netflix Movies and TV Shows Dataset)** because having a understand media consumption patterns, preferences, and trends among users is key in the media industry. By analyzing the data in the given columns, I can gain insights into the popularity of different types of media content, which genres are most in demand, which actors, directors, and creators are the most popular, and which streaming services or sources are most commonly used by users( in our case Netflix).

This data can be useful for media companies, streaming services, and content creators who want to understand their audience and tailor their offerings accordingly. For example, with this a streaming service can use this data to decide which TV shows or movies to add to their platform, based on the popularity of different genres and actors. They can also use this data to personalize recommendations for users based on their preferences and viewing history.

# `Explain the data sets, what do they describe, and what information you intend to use from these data set`:

   **The dataset contains the following columns:**

   * **url:** The URL where the media content can be accessed or downloaded.
   * **name:** The name of the media content.
   * **contentType:** The type of media content, such as TV show, movie, or documentary.
   * **description:** A brief description of the media content.
   * **contentRating:** The rating assigned to the media content, such as PG or R.
   * **genre:** The genre or category that the media content belongs to, such as drama, comedy, or action.
   * **poster:** The URL of a poster or image associated with the media content.
   * **formattedDuration:** The duration of the media content, formatted in a standard way (e.g. "1h 30m").
   * **releasedDate:** The date that the media content was originally released.
   * **actors:** The names of the actors who appear in the media content.
   * **director:** The name of the director of the media content.
   * **creator:** The name of the creator or writer of the media content.
   * **audio:** The language or languages available for the audio track of the media content.
   * **subtitle:** The language or languages available for subtitles of the media content.
   * **availableForOffline:** A boolean value indicating whether the media content can be downloaded and viewed offline.
   * **numberOfSeasons:** The number of seasons in the case of TV shows.
   * **seasonStartDate:** The start date of the season.
   * **source:** The source of the media content, such as Netflix or Amazon Prime Video.
   * **sourceLink:** The URL of the source where the media content was obtained from.
   * **uniqId:** A unique identifier for the media content.
   * **scrapedAt:** The date and time when the data was scraped or obtained.

# load data :
 In this cell we will be importing pandas and plotly.express **(python libraries)**, which will then let us read our csv **(comer separated values)** using **`pd.read_csv`** this dataset is then stored in in the variable called **`show`** and let us prsent our results in visual form for better processing respectivly.

In [1]:
import pandas as pd
import plotly.express as px
show = pd.read_csv('https://query.data.world/s/3iniczknql3cry2qjkabz4q7p2k6wv')

# Data_cleaning 1:

Here we removed the **`+`**  sign in the contentRating and converted the retuning values to Integers. Also in this very column we had movies rated **`All`** and  for this column to be used effectively we had to also converted all `All` values to `0` using a lamda function.

In [2]:
show = show[show['contentRating'] != 'All']
show['contentRating'] = show['contentRating'].apply(lambda x: int(x.replace('+', '')) if isinstance(x, str) and x != 'All' else 0)

# Data_cleaning 2:

Here we extracted the year from **`releasedDate`** column. To do this we had to over-ride the original data we then convert the whole column to string so as to use the `slice` function to pick the first four members of the string which in are case is the year in whic we need.

In [3]:
show['releasedDate'] = show['releasedDate'].str.slice(0, 4).astype(int)

# Data_cleaning 3: 

Here we created subset dataframe called **`subset`** with the modified **`contentRating`** and **`releasedDate`** columns from `Data_cleaning 1 and 2` respectively.

In [4]:
subset = show[['contentRating', 'genre', 'releasedDate', 'formattedDuration']]

# Data_cleaning 4:

Here we counted the number of occurrences of each genre using the **`value_counts`** function and stord in a variable called **`genre_counts`**. This results where now filtered to have only the values higher/greater than two, and we also counted the values of **`releasedDate`** and stored in a variable called  **`releasedDate_counts`** all so as to have a much cleaner review. which will be seen as we progress into visualising our data soon.

In [5]:
genre_counts = subset['genre'].value_counts()
genre_counts_filtered = genre_counts[genre_counts > 2]
releasedDate_counts = subset['releasedDate'].value_counts()

# Graph 1
# `Plotting A Bar Chart`
* This bar chart is to show the the most watched genre
Here we added a couple of customs to have the graph look self explanatory and this are listed below:
* we ploted using values gotten from `genre_counts_filtered(subset)`(gotten from our previous cleaning) the main porpose of this graph was to give a viusal representation of of the most watched genre within the years this makes it easy to know what the trend has been over the years.
* we changed the value of the hover_data so as to have a better understanding and view of our graph outside the default.
* we added titles to both the `x` and `y` axis for better understing by anyone.


In [6]:
fig_1 = px.bar(subset,
               x=genre_counts_filtered.index, 
               y=genre_counts_filtered.values, 
               color=genre_counts_filtered.values,
               hover_data={'Genres': genre_counts_filtered.index,
                           'Frequency': genre_counts_filtered.values})

fig_1.update_layout(
    yaxis_tickformat=',',
    title_text='Most watched Genre as of Dec 31, 2015 to May 3, 2022',
    xaxis_title='Genres',
    yaxis_title='Frequency',
    legend=dict(title='Year'))

fig_1.update_traces(hovertemplate='Genre: %{customdata[0]}<br>Frequency: %{y:,}')

fig_1.show()


# Graph 2
# `Plotting A Hallow Pie Chart`
* This Pie chart show us the year with the most rleases
Here we also added a couple of customs to have the graph look self explanatory and this are listed below:
* we ploted using values gotten from `releasedDate_counts(subset)`(gotten from our previous cleaning above) the main porpose of this graph was to give a viusal representation of of the year with the highest production making it easy to know what the the rate at which the industry's has been growing over the years.
* we added different colors this  time to the plot to  have a even more presentable view of our data.
* title  was placed at the center

In [7]:
colors = ['#ff5733','#32456E','#1ACC38', '#ffd633']
fig_2 = px.pie(subset, values=releasedDate_counts.values,
             names=releasedDate_counts.index,
             hole=0.2,
            color_discrete_sequence=colors)
fig_2.update_traces(textfont=dict(size=14),hovertemplate='Year = %{label}<br>Released =  %{value}<br>precentage = %{percent}')

fig_2.update_layout(
    title_text='Years with the Most Releases', title_x=.5, title_y=.9,
    font=dict(size=18, color= '#32456E'),
    legend=dict(title_text='Years'),
    margin=dict(l=0, r=0, t=100, b=0),
    height=500)
fig_2.show()

# Data_cleaning 5:
This line of code performs a **`groupby`** operation on `subset` to group rows by the values in the `releasedDate` column.

Once the rows had been grouped by the values in the **`releasedDate`** column, the **`max`** method is called on the resulting **`groupby`** object to determine the maximum value in the `contentRating` column for each group. We the stored our results in a new pandas Series object called **`max_ratings`** that contains the maximum `contentRating` value for each unique `releasedDate` value in the `subset` DataFrame.

In [8]:
max_ratings = subset.groupby('releasedDate')['contentRating'].max()
max_ratings

releasedDate
2015    18
2016    18
2017    18
2018    18
2019    18
2020    18
2021    18
2022    18
Name: contentRating, dtype: int64

# Data_cleaning 6:
* The first line of code creates an new dataFrame mapping **`released_year`** to **`max_ratings.index`** and **`max_rating`** to **`max_ratings.values`** (all gotten from the cell above)
* This new dataFrame(**`data`**) is used in the second line of code which for each year in the `released_year column` of `data`, the corresponding count of the number of movies and TV shows released on that year is extracted from `releasedDate_counts`, and added to the `num_shows` column of the `data` DataFrame.

In [9]:
data = pd.DataFrame({'released_year': max_ratings.index, 'max_rating': max_ratings.values})
data['num_shows'] = [releasedDate_counts[date] for date in data['released_year']]

# Graph 3
# `Plotting A Scatter Plot`
* This Scatter show us the year which had the most production of `R` rated movies
Here we also added a couple of customs to have the plot look self explanatory and this are listed below:
* we ploted using values gotten from `released_year(data), max_rating(data) and, num_shows(data)`(all gotten from our resent cleaning above) the main porpose of this plot was to give a viusal representation of the year with the highest production of `R` shows, it easy to know what the age group of people are consuming the media products over the years using ths plot.
* we added different color gradient this  time to the plot to  have a even more presentable view of our data.
* title was also placed at the center

In [10]:
color_sequence = ['#ff5733','#32456E','#1ACC38', '#ffd633']
fig_3 = px.scatter(data, x=data['released_year'], y=data['max_rating'], size=data['num_shows'], size_max=50, hover_data=['released_year', 'max_rating'],
                  color=data['num_shows'], color_continuous_scale=color_sequence,)
fig_3.update_layout(
    title='Year with Highest Rating vs Year of Release',
    title_x=0.5, title_y=.93,
    xaxis_title='Year of Release',
    yaxis_title='Maximum Content Rating',
    legend=dict(title_text='Years')
)
fig_3.show()

# `Citation `:

Crawl Feeds. "Netflix Movies and TV Shows Dataset." Data World, last modified February 8, 2023. Accessed on February 16, 2023. Available from **https://data.world/crawlfeeds/netflix-movies-and-tv-shows-dataset**.

The main data source can be obtained from the URL provided. The specific files used from this data set were not specified in the information provided, but the data set as a whole consists of 20 fields of data and covers the time period from Dec 31, 2015 to May 3, 2022. Additional information about the data source can be found at the URL provided.