# Week 2 - Preprocessing, part 2

# 1. Lesson: None

# 2. Weekly graph question

The Storytelling With Data book mentions planning on a "Who, What, and How" for your data story.  Write down a possible Who, What, and How for your data, using the ideas in the book.

In [18]:
'''
Who
    Audience: The intended audience for this data story includes Netflix content managers, product managers, and data analysts who are looking to understand trends in Netflix show releases, their scores, and the number of seasons. This could also be relevant for marketing and business analysts who want insights to help in decision-making about future content strategy.
    Context: These professionals aim to optimize the content selection process, understand what makes a successful show (e.g., score and number of seasons), and ensure that the content library is well-balanced in terms of genre, production, and release year.
What
    Objective: The goal is to examine the patterns in Netflix shows across several key attributes: release year, score, number of seasons, genre, and production company. Specifically, we aim to answer questions such as:
    What is the average score of Netflix shows across years?
    Do certain genres or production companies tend to have higher ratings?
    How do the number of seasons relate to show scores?
    Key Insights: Key insights may include the identification of top genres with higher ratings, trends in show releases over time, and whether shows with more seasons tend to score better or worse.
How
    Visualizations:
        A line chart to visualize the trend of show ratings over the years.
        Bar charts comparing scores across different genres or production companies.
        A scatter plot to explore the relationship between the number of seasons and the show score.
    Narrative: The story would start by introducing the dataset and its focus on Netflix show performance, followed by a breakdown of trends (e.g., how scores have evolved or which genres are most highly rated). We would highlight key findings (e.g., if shows with more seasons tend to have higher scores or if certain genres dominate in terms of production quality).
    Design: Keep the visualizations simple with clear labels and consistent color schemes. Use charts that are easy to interpret at a glance to maintain clarity and ensure the audience can quickly absorb the key takeaways.
'''


'\nWho\n    Audience: The intended audience for this data story includes Netflix content managers, product managers, and data analysts who are looking to understand trends in Netflix show releases, their scores, and the number of seasons. This could also be relevant for marketing and business analysts who want insights to help in decision-making about future content strategy.\n    Context: These professionals aim to optimize the content selection process, understand what makes a successful show (e.g., score and number of seasons), and ensure that the content library is well-balanced in terms of genre, production, and release year.\nWhat\n    Objective: The goal is to examine the patterns in Netflix shows across several key attributes: release year, score, number of seasons, genre, and production company. Specifically, we aim to answer questions such as:\n    What is the average score of Netflix shows across years?\n    Do certain genres or production companies tend to have higher ratin

# 3. Homework - work with your own data

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

This week, you will do the same types of exercises as last week, but you should use your chosen datasets that someone in your class found last semester. (They likely will not be the particular datasets that you found yourself.)

### Here are some types of analysis you can do  Use Google, documentation, and ChatGPT to help you:

- Summarize the datasets using info() and describe()

- Are there any duplicate rows?

- Are there any duplicate values in a given column (when this would be inappropriate?)

- What are the mean, median, and mode of each column?

- Are there any missing or null values?

    - Do you want to fill in the missing value with a mean value?  A value of your choice?  Remove that row?

- Identify any other inconsistent data (e.g. someone seems to be taking an action before they are born.)

- Encode any categorical variables (e.g. with one-hot encoding.)

### Conclusions:

- Are the data usable?  If not, find some new data!

- Do you need to modify or correct the data in some way?

- Is there any class imbalance?  (Categories that have many more items than other categories).

# 4. Storytelling With Data graph

Just like last week: choose any graph in the Introduction of Storytelling With Data. Use matplotlib to reproduce it in a rough way. I don't expect you to spend an enormous amount of time on this; I understand that you likely will not have time to re-create every feature of the graph. However, if you're excited about learning to use matplotlib, this is a good way to do that. You don't have to duplicate the exact values on the graph; just the same rough shape will be enough.  If you don't feel comfortable using matplotlib yet, do the best you can and write down what you tried or what Google searches you did to find the answers.

In [14]:
data = pd.read_csv('Best Show by Year Netflix.csv')

info = data.info()
describe = data.describe()

# Check for duplicate rows
duplicates = data.duplicated().sum()
print("\nDuplicate rows:", duplicates)

# Check for duplicate values in each column
duplicate_values = {col: data[col].duplicated().sum() for col in data.columns}
print("\nDuplicate cols:",duplicate_values)

# Check for missing values
missing_values = data.isnull().sum()
print("\nMissing Values:\n",missing_values)

print("\nmean median and mode")
# Calculate the mean, median, and mode of each column
mean_values = data.mean(numeric_only=True)
median_values = data.median(numeric_only=True)
mode_values = data.mode(numeric_only=True).iloc[0]

# Display all the relevant information
mean_values, median_values, mode_values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   TITLE              31 non-null     object 
 1   RELEASE_YEAR       31 non-null     int64  
 2   SCORE              31 non-null     float64
 3   NUMBER_OF_SEASONS  31 non-null     int64  
 4   MAIN_GENRE         31 non-null     object 
 5   MAIN_PRODUCTION    31 non-null     object 
dtypes: float64(1), int64(2), object(3)
memory usage: 1.6+ KB

Duplicate rows: 0

Duplicate cols: {'TITLE': np.int64(0), 'RELEASE_YEAR': np.int64(0), 'SCORE': np.int64(16), 'NUMBER_OF_SEASONS': np.int64(18), 'MAIN_GENRE': np.int64(25), 'MAIN_PRODUCTION': np.int64(26)}

Missing Values:
 TITLE                0
RELEASE_YEAR         0
SCORE                0
NUMBER_OF_SEASONS    0
MAIN_GENRE           0
MAIN_PRODUCTION      0
dtype: int64

mean median and mode


(RELEASE_YEAR         2005.645161
 SCORE                   8.606452
 NUMBER_OF_SEASONS       5.322581
 dtype: float64,
 RELEASE_YEAR         2007.0
 SCORE                   8.8
 NUMBER_OF_SEASONS       5.0
 dtype: float64,
 RELEASE_YEAR         1969.0
 SCORE                   8.8
 NUMBER_OF_SEASONS       1.0
 Name: 0, dtype: float64)

In [19]:
'''
Conclusion
    There are no missing values or duplicates in the rows.
    The SCORE, NUMBER_OF_SEASONS, MAIN_GENRE, and MAIN_PRODUCTION columns have several duplicate values, which might need attention if those values should be unique.
    The data looks usable overall without need for missing value handling.
'''

'\nConclusion\n    There are no missing values or duplicates in the rows.\n    The SCORE, NUMBER_OF_SEASONS, MAIN_GENRE, and MAIN_PRODUCTION columns have several duplicate values, which might need attention if those values should be unique.\n    The data looks usable overall without need for missing value handling.\n'