<a href="https://colab.research.google.com/github/EstevahnAguilera/Data-Science-Projects/blob/main/Working_with_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Python: Project

## Project Description
In this project, you will work with data from the entertainment industry. You will study a dataset with records on movies and shows. The research will focus on the “Golden Age” of television, which began in 1999 with the release of The Sopranos and is still ongoing.

The aim of this project is to investigate how the number of votes a title receives from IMDb users impacts its ratings. The assumption is that highly-rated shows (we will focus on TV shows, ignoring movies) released during the “Golden Age” of television also have the most votes.

This project is similar to the tasks you will be getting in your job as a data professional. Many business decisions are initially born as assumptions, your contribution as an expert in the data domain is to answer the question “Did the assumption formulated before the study appear to be true?”

## Description of the data:
The dataset movies_and_shows.csv contains information about various movies and shows, including:

- name: The name of the actor or actress.
- Character: The character they played.
- r0le: The role type (e.g., ACTOR).
- TITLE: The title of the movie or show.
- Type: Whether it's a MOVIE or SHOW.
- release Year: The year it was released.
- genres: A list of genres the movie or show belongs to.
- imdb sc0re: The IMDb score of the movie or show.
- imdb v0tes: The number of votes on IMDb.

# Getting Started
We will begind by setting up our environment.

## Importing Libraries
We need to import the necessary libraries.

In [1]:
# Importing pandas
import pandas as pd
# Importing the library for google sheets
from google.colab import files
uploaded = files.upload() # Movies and Shows

Saving movies_and_shows.csv to movies_and_shows.csv


In [2]:
df = pd.read_csv('movies_and_shows.csv')

Next, we will load the dataset into a pandas DataFrame to begin analysis.

In [3]:
# Displaying the first few rows of the DataFrame
display(df.head())

Unnamed: 0,name,Character,r0le,TITLE,Type,release Year,genres,imdb sc0re,imdb v0tes
0,Robert De Niro,Travis Bickle,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0
1,Jodie Foster,Iris Steensma,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0
2,Albert Brooks,Tom,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0
3,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0
4,Cybill Shepherd,Betsy,ACTOR,Taxi Driver,MOVIE,1976,"['drama', 'crime']",8.2,808582.0


Understanding the structure and content of your data is important before performing any analysis.

We will need to look at the DataFrame's information.

In [4]:
# Getting information about the DataFrame and the column names
df.info()

print(df.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85579 entries, 0 to 85578
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0      name       85579 non-null  object 
 1   Character     85579 non-null  object 
 2   r0le          85579 non-null  object 
 3   TITLE         85578 non-null  object 
 4     Type        85579 non-null  object 
 5   release Year  85579 non-null  int64  
 6   genres        85579 non-null  object 
 7   imdb sc0re    80970 non-null  float64
 8   imdb v0tes    80853 non-null  float64
dtypes: float64(2), int64(1), object(6)
memory usage: 5.9+ MB
Index(['   name', 'Character', 'r0le', 'TITLE', '  Type', 'release Year',
       'genres', 'imdb sc0re', 'imdb v0tes'],
      dtype='object')


# Task 1: Data Cleaning
Let's clean the data to fix issues with the column names.

We need to rename the columns to correct any errors and make them consistent.

In [5]:
# Rename columns to make them consistent and correct errors
df = df.rename(
    columns = {
        '   name': 'name',
        'Character':'character',
        'r0le':'role ',
        'TITLE':'title ',
        'Type': 'type ',
        'release Year':'release_year',
        'imdb sc0re':'imdb_score',
        'imdb v0tes':'imdb_votes'
    }
)

# Printing the updated column names to confirm the changes
print(df.columns)

Index(['name', 'character', 'role ', 'title ', '  Type', 'release_year',
       'genres', 'imdb_score', 'imdb_votes'],
      dtype='object')


# Task 2: Correcting a Misspelled Name in the Data
While analyzing the dataset, you notice that some names are misspelled of contain special characters due to encoding issues. Accurate data is essntials for reporting and recommendations, so let's correct one of these entries.

In [6]:
# Looking at the Unique names in the Data
print(df['name'].unique())

['Robert De Niro' 'Jodie Foster' 'Albert Brooks' ... 'In??s Prieto'
 'Isabel Gaona' 'Julian Gaviria']


We can see that "In??s Preito" is misspelled.
- Using .loc[], retrieve the row where name is "In??s Preito" and print the row to verify that you have the correct one.
- Using .loc[], update the name column for this row to "Ines Prieto".
- Verify the row again to ensure that the name has been corrected.

In [7]:
# Locate the row(s) with the incorrect name
print(df.loc[df['name'] == "In??s Prieto"])

# Correcting the name
df.loc[df['name'] == "In??s Prieto", 'name'] = "Ines Prieto"

               name character  role    title        Type  release_year  \
77798  In??s Prieto     Fanny  ACTOR  Lokillo      MOVIE          2021   
85576  In??s Prieto     Fanny  ACTOR  Lokillo  the movie          2021   

           genres  imdb_score  imdb_votes  
77798  ['comedy']         3.8        68.0  
85576  ['comedy']         3.8        68.0  


In [8]:
# Verify the correction
print(df.loc[df['name'] == "Ines Prieto"])

              name character  role    title        Type  release_year  \
77798  Ines Prieto     Fanny  ACTOR  Lokillo      MOVIE          2021   
85576  Ines Prieto     Fanny  ACTOR  Lokillo  the movie          2021   

           genres  imdb_score  imdb_votes  
77798  ['comedy']         3.8        68.0  
85576  ['comedy']         3.8        68.0  


# Task 3: Finding All Movies and Shows Featuring Ines Prieto
Now that we've corrected the spelling of "Ines Prieto" in the dataset, let's find all the TV shows and movies she has acted in. This type of filtering is helpful for generating actor-specific profiles or building a list of their works.

- Using a filtering condition, select rows where the name column is "Ines Prieto".
- From each matching row, retrive onlt the title, release_year, imbd_score, and genres columns for a clear, concise output.

In [9]:
# Filtering rows where the actor's name is "Ines Prieto"
df.loc[df['name'] == "Ines Prieto"]

Unnamed: 0,name,character,role,title,Type,release_year,genres,imdb_score,imdb_votes
77798,Ines Prieto,Fanny,ACTOR,Lokillo,MOVIE,2021,['comedy'],3.8,68.0
85576,Ines Prieto,Fanny,ACTOR,Lokillo,the movie,2021,['comedy'],3.8,68.0


In [10]:
# Displaying the results
df.loc[df['name'] == "Ines Prieto", ['title ', 'release_year', 'imdb_score', 'genres']]

Unnamed: 0,title,release_year,imdb_score,genres
77798,Lokillo,2021,3.8,['comedy']
85576,Lokillo,2021,3.8,['comedy']


# Task 4: Finding Highly Rated Movies
We want to identify movies with an IMDb rating of at least 9.0. This list could be helpful for curating a "Top Movies" section based on high ratings.
- First filter the DataFrame to include only rows where imdb_score is greater than 9.0.
- From this filtered DataFrame, select only the title column.
- Convert the resulting list of titles to a set to remove any duplicate titles. Using set() will keep only unique movie names.
- Display the final set of unique movie titles to see the list of top-rated movies.

In [11]:
# The sheet didn't transfer the types over correctly.
# Use pd.to_numeric with errors='coerce' to handle non-numeric values by turning them into NaN.
df['imdb_score'] = pd.to_numeric(df['imdb_score'], errors='coerce')

# Filtering for movies with IMDb score above 9.0 and dropping any rows with NaN in imdb_score
score = df[df['imdb_score'] > 9.0].dropna(subset=['imdb_score'])

#Extract the 'title ' column from the filtered DataFrame
titles = score['title ']

# Get a unique set of titles
unique_titles = set(titles)

# Print the unique titles
print(unique_titles)

{'Avatar: The Last Airbender', 'Kota Factory', 'Our Planet', 'Major', 'Reply 1988', 'My Mister', 'The Last Dance', 'Breaking Bad'}


# Task 5: Creating a Function to Find Unique Top-Rated Movies
In this task, we'll create a function to find unique movies with an IMDb score above a certain threshold that the user provides.
- Create a function called get_unique__top_movies that takes one parameter, min_score.
- Inside the function, create a new variable that stores rows where the imbd_score is greater than or equal to min_score.
- In a new varibale, select the title column from the filtered DataFrame.
- Convert the variable to a set using set() to automatically remove duplicates, ensuring each title appears only once.
- Make sure the function returns the unique_titles set.


In [12]:
# The sheet didn't transfer the types over correctly.
# Use pd.to_numeric with errors='coerce' to handle non-numeric values by turning them into NaN.
df['imdb_votes'] = pd.to_numeric(df['imdb_votes'], errors='coerce')

# Define the function
def get_unique_top_movies(min_score):
    # Filter for movies with IMDb score above min_score
    high_score_df = df[df['imdb_score'] >= min_score]

    # Extract the 'title' column
    high_score_titles = df.loc[df['imdb_score'] >= min_score, 'title ']

    # Remove duplicate titles
    high_score_titles = set(high_score_titles)

    # Return unique titles
    return high_score_titles

In [13]:
# Test the function
print(get_unique_top_movies(9.0))

{'Okupas', 'Hunter x Hunter', 'Avatar: The Last Airbender', 'Leah Remini: Scientology and the Aftermath', 'Our Planet', 'Kota Factory', 'Attack on Titan', 'Major', 'Reply 1988', 'My Mister', 'DEATH NOTE', 'The Last Dance', 'Breaking Bad', 'Arcane'}


# Task 6: Creating a Function to Find Top Movies from a Specific Decade
Let's create a function to retrieve movies from a particular decade with high IMDb ratings. This function can be useful for generating lists of top-rated movies from different time periods.
- Create a function called get_top_movies_from_decade that accepts decade_start and min_score as parameters.
- Inside the function, filter the DataFrame to include only movies where release_year is within the specified decade.
- Further filter this subset to include only movies where imdb_score is greater than or equal to min_score.
- From the resulting DataFrame, select the title column and remove duplicates using set().
- Return the set of the best movies from that decade.




In [14]:
# Define the function
def get_top_movies_from_decade(decade_start, min_score):
    # Filter for movies released within the decade
    decade = df[(df['release_year'] >= decade_start) & (df['release_year'] <= decade_start + 9)]

    # Further filter by IMDb score
    decade = decade[decade['imdb_score'] >= min_score]

    # Extract and remove duplicate titles
    titles = set(decade['title '])

    # Return unique titles
    return titles

In [15]:
# Test the function
print(get_top_movies_from_decade(1990, 8.5))

{'GoodFellas', 'One Piece', 'L??on: The Professional', 'Bill Hicks: Revelations', 'Forrest Gump', 'Se7en', 'Neon Genesis Evangelion', 'Cowboy Bebop'}


# Task 7: Creating a Function to List All Actors in a Given Title

Imagine you want to list all the actors in a specific movie or show. Let's create a function that takes a title as input and returns the names of all actors in that title, combined into a single string.
- Create a function called get_actors_for_title that accepts one parameter, title.
- Inside the function, filter the DataFrame to select rows where the title column matches the provided title parameter and the role column is "ACTOR" (to ensure you only retrieve actors).
- From this filtered DataFrame, select only the `name` column to get the actor names.
- Use `', '.join()` to combine the list of actor names into a single string, with each name separated by a comma.
- Return the resulting string of actor names.



In [16]:
# Define the function
def get_actors_for_title(title):
    # Filter for rows with the specified title and role as 'ACTOR'
    title_df = df[(df['title '] == title) & (df['role '] == 'ACTOR')]


    # Extract the 'name' column for actor names
    title_df = title_df['name']

    # Combine names into a single string
    combined_names = ', '.join(title_df.tolist())


    # Return the result
    return combined_names

In [17]:
# Test the function
print(get_actors_for_title("Taxi Driver"))

Robert De Niro, Jodie Foster, Albert Brooks, Harvey Keitel, Cybill Shepherd, Peter Boyle, Leonard Harris, Diahnne Abbott, Gino Ardito, Martin Scorsese, Murray Moston, Richard Higgs, Bill Minkin, Bob Maroff, Victor Argo, Joe Spinell, Robinson Frank Adu, Brenda Dickson, Norman Matlock, Harry Northup, Harlan Cary Poe, Steven Prince, Peter Savage, Nicholas Shields, Ralph S. Singleton, Annie Gagen, Carson Grant, Mary-Pat Green, Debbi Morgan, Don Stroud, Copper Cunningham, Garth Avery, Nat Grant, Billie Perkins, Catherine Scorsese, Charles Scorsese, Odunlade Adekola, Ijeoma Grace Agu


# Task 8: Creating a Function to Categorize Movies by IMDb Score
Let's categorize movies and shows based on their IMDb scores to provide a quick evaluation of their popularity or quality. We'll create a function that takes in a title and returns a rating category based on the movie or show's IMDb score.

- Create a function called categorize_imdb_score that accepts one parameter, title.
- Inside the function, filter the DataFrame to find the row where title matches the given title.
- If the title exists, retrieve the imdb_score for that movie or show.
- If the title is not found, return "Title not found".
- Use if-elif-else statements to evaluate the imdb_score and return one of the rating categories based on the ranges provided.
- Return the appropriate category based on the score.

**Rating Categories**
- Excellent: IMDb score of 9.0 or higher
- Good: IMDb score between 7.0 and 8.9
- Average: IMDb score between 5.0 and 6.9
- Low: IMDb score below 5.0



In [18]:
# Define the function
def categorize_imdb_score(title):
    # Filter for the row with the specified title
    title_df = df[df['title '] == title]

    # Check if title exists
    if title_df.empty:
        return "Title not found"

    else:
        # Retrieve the IMDb score for the movie
        results = []
        for score in title_df['imdb_score']:
            if score >= 9.0:
                results.append(f'Score: {score} - Excellent!!')
            elif (score >= 7.0) & (score <= 8.9):
                results.append(f'Score: {score} - Good!')
            elif (score >= 5.0) & (score <= 6.5):
                results.append(f'Score: {score} - Average.')
            else:
                results.append(f'Score: {score} - Low...')


    # Categorize score using if-else and return the ranking accordingly
    return results

In [19]:
# Test the function
print(categorize_imdb_score("Taxi Driver"))

['Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 8.2 - Good!', 'Score: 6.0 - Average.', 'Score: 6.0 - Average.', 'Score: 6.0 - Average.']
