# Everyone's watching!

# Data Manipulation with Python

Welcome to the QCL Workshop **Data Manipulation with Python**.

This is a Level 2 Workshop, so I will assume you are familiar with the topics covered in the **Practical Programming with Python** Workshop (Level 1):

* Basics of Jupyter notebook
* Variables
* Functions
* Lists and dictionaries
* For loops
* Conditional statements

![](imgs/np_pd_logos.png)

By the end of this workshop you will be able to:

* Import and export data
* Explore and subset DataFrames
* 

## Today's data

Today's data was taken from [Kaggle](https://www.kaggle.com/). The dataset was scrapped from [IMDb](https://www.imdb.com/), a website that contains information about movies and TV shows. Registered users' votes are then summarized as their rating.

The goal for today will be to explore and clean the IMDb's top [50,000 TV shows](https://www.kaggle.com/datasets/muralidharbhusal/50000-imdb-tv-and-web-series) dataset.

## Pandas and Numpy

Pandas is a powerful and flexible library used for data exploration and transformation. Pandas is built on top of NumPy and uses DataFrames as its main data structure. NumPy (Numerical Python) provides the tools for efficient numerical computation (e.g. matrix multiplication) and uses multidimensional arrays as its data structure.

|                         Pandas                        |                   NumPy                  |
|:-----------------------------------------------------:|:----------------------------------------:|
|          Uses a 2D data structure (DataFrame)         | Capable of using multidimensional arrays |
|               Slower compared with NumPy              |            Faster than Pandas            |
| Columns in a DataFrame can be of different data types | Arrays can only be of one data type      |

To use a library in Python, we first need to import it using an `import` statement. For example, we can import NumPy.

In [121]:
# Import NumPy
import numpy

print(numpy.pi)

3.141592653589793


In [122]:
# Import with an alias
import numpy as np

print(np.pi)

3.141592653589793


## Pandas DataFrames

A DataFrame is a 2D data structure where each column can be of a different type. Both, rows and columns are labeled.

### Create a DataFrame

One way to construct a DataFrame is by using a dictionary.

In [2]:
import pandas as pd

# Create a dictionary
data_dict = {'Title': ['Wednesday', 'The White Lotus'], 
             'Release Year': [2022, 2021], 
             'Rating': [8.2, 7.9]}

# Create the DataFrame
df = pd.DataFrame(data_dict)
print(df)

             Title  Release Year  Rating
0        Wednesday          2022     8.2
1  The White Lotus          2021     7.9


Creating DataFrames by hand is useful for testing purposes. However, in most cases we will need to read our data from a CSV (Comma Separated Values) file or any other file format.

### Importing data

Pandas has many functions to import data from different sources. For example, we can read CSV files using the `read_csv()` function.

In [3]:
# Read today's data
imdb = pd.read_csv("data/imdb_tv_series.csv")

<div class="alert alert-block alert-warning">
    <b>Note:</b> Pandas also has functions to read Excel files (<code>read_excel()</code>) and even SQL files (<code>read_sql()</code>).
</div>

### Exploring your DataFrame

The IMDb dataset is considerably larger than our first DataFrame which is why printing the whole thing is not recommended.

In [125]:
# Print the IMDB DataFrame
print(imdb)

                                           Title  Release Year  End Year  \
0                                      Wednesday        2022.0       NaN   
1                                    Yellowstone        2018.0       NaN   
2                                The White Lotus        2021.0    2023.0   
3                                           1923        2022.0    2023.0   
4                                      Jack Ryan        2018.0       NaN   
...                                          ...           ...       ...   
49995          Law & Order: Special Victims Unit        1999.0       NaN   
49996                                 Doctor Who        2005.0       NaN   
49997  The Lord of the Rings: The Rings of Power        2022.0       NaN   
49998                                   The Bear        2022.0       NaN   
49999                               Supernatural        2005.0    2020.0   

      Rating                                               Cast Runtime  \
0        8.2

However, we can explore our dataset using some useful Pandas functions. The `head()` function will print the first 5 rows of our DataFrame by default.

In [126]:
# First few rows of our DataFrame
imdb.head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
0,Wednesday,2022.0,,8.2,"Jenna Ortega, Hunter Doohan, Percy Hynes White...",45 min,"Comedy, Crime, Fantasy"
1,Yellowstone,2018.0,,8.7,"Kevin Costner, Luke Grimes, Kelly Reilly, Wes ...",60 min,"Drama, Western"
2,The White Lotus,2021.0,2023.0,7.9,"Jennifer Coolidge, Jon Gries, F. Murray Abraha...",60 min,"Comedy, Drama"
3,1923,2022.0,2023.0,8.6,"Harrison Ford, Helen Mirren, Brandon Sklenar, ...",60 min,"Drama, Western"
4,Jack Ryan,2018.0,,8.0,"John Krasinski, Wendell Pierce, Michael Kelly,...",60 min,"Action, Drama, Thriller"


We can also obtain the dimensions of our DataFrame using `shape`.

In [127]:
# Dimensions of our DataFrame
imdb.shape

(50000, 7)

<div class="alert alert-block alert-warning">
    <b>Note:</b> While <code>head()</code> is a method (i.e. a function associated with a Pandas DataFrame), <code>shape</code> is an attribute. This is why we do not use parentheses.
</div>

We can also access the full list of column names of our DataFrame using `columns`.

In [128]:
# Columns of our DataFrame
imdb.columns

Index(['Title', 'Release Year', 'End Year', 'Rating', 'Cast', 'Runtime',
       'Genre'],
      dtype='object')

To generate descriptive statistics of our DataFrame, we can use the `describe()` function.

In [129]:
# Descriptive statistics of our DataFrame
imdb.describe()

Unnamed: 0,Release Year,End Year
count,49791.0,17410.0
mean,2014.403587,2012.677771
std,12.131611,15.465337
min,1930.0,1953.0
25%,2011.0,2008.0
50%,2020.0,2020.0
75%,2022.0,2022.0
max,2024.0,2023.0


Notice that `describe()`, by default, only gives us information about the release and end years.

The `info()` function prints a summary  of the DataFrame including the columns' data types (Dtype) and a Non-Null count.

In [130]:
# Summary of our DataFrame
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Title         50000 non-null  object 
 1   Release Year  49791 non-null  float64
 2   End Year      17410 non-null  float64
 3   Rating        50000 non-null  object 
 4   Cast          49909 non-null  object 
 5   Runtime       50000 non-null  object 
 6   Genre         50000 non-null  object 
dtypes: float64(2), object(5)
memory usage: 2.7+ MB


We can see from this output that there are some missing values in our Release Year, End Year and Cast columns. We will deal with this later.

Some of the Pandas data types you may encounter are:

| Pandas dtype |                     Usage                    |
|:------------:|:--------------------------------------------:|
|    object    | Text or mixed numeric and non-numeric values |
|     int64    |                Integer numbers               |
|    float64   |            Floating point numbers            |
|     bool     |               True/False values              |

These functions give us general information about our data. However, we can also explore subsets of our data.

### Subsetting your DataFrame

We can select columns of the DataFrame using the column name

In [131]:
# Select the titles
imdb['Title']

0                                        Wednesday
1                                      Yellowstone
2                                  The White Lotus
3                                             1923
4                                        Jack Ryan
                           ...                    
49995            Law & Order: Special Victims Unit
49996                                   Doctor Who
49997    The Lord of the Rings: The Rings of Power
49998                                     The Bear
49999                                 Supernatural
Name: Title, Length: 50000, dtype: object

<div class="alert alert-block alert-warning">
<b>Note:</b> Each column in a DataFrame is a Pandas Series. We can think of a DataFrame as a dictionary of Pandas Series.
</div>

If we do not want all the rows we can use the `head()` function at the end. We can also select multiple columns by using a list of column names.

In [132]:
# Select multiple columns
imdb[['Title', 'Release Year', 'Rating']].head()

Unnamed: 0,Title,Release Year,Rating
0,Wednesday,2022.0,8.2
1,Yellowstone,2018.0,8.7
2,The White Lotus,2021.0,7.9
3,1923,2022.0,8.6
4,Jack Ryan,2018.0,8.0


Sometimes, it is useful to store the list of columns we want to access on a separate variable.

In [133]:
# Select multiple columns
cols = ['Title', 'Release Year', 'Rating']
imdb[cols].head()

Unnamed: 0,Title,Release Year,Rating
0,Wednesday,2022.0,8.2
1,Yellowstone,2018.0,8.7
2,The White Lotus,2021.0,7.9
3,1923,2022.0,8.6
4,Jack Ryan,2018.0,8.0


We can also select rows of our DataFrame based on a condition. Using comparison operators on a column returns a column of the same length with boolean values.

In [134]:
# Condition on Release Year
imdb['Release Year'] > 2010

0         True
1         True
2         True
3         True
4         True
         ...  
49995    False
49996    False
49997     True
49998     True
49999    False
Name: Release Year, Length: 50000, dtype: bool

We can use this to select only the rows of those TV shows that were released after 2010.

In [135]:
# TV shows released after 2010
imdb[imdb['Release Year'] > 2010].head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
0,Wednesday,2022.0,,8.2,"Jenna Ortega, Hunter Doohan, Percy Hynes White...",45 min,"Comedy, Crime, Fantasy"
1,Yellowstone,2018.0,,8.7,"Kevin Costner, Luke Grimes, Kelly Reilly, Wes ...",60 min,"Drama, Western"
2,The White Lotus,2021.0,2023.0,7.9,"Jennifer Coolidge, Jon Gries, F. Murray Abraha...",60 min,"Comedy, Drama"
3,1923,2022.0,2023.0,8.6,"Harrison Ford, Helen Mirren, Brandon Sklenar, ...",60 min,"Drama, Western"
4,Jack Ryan,2018.0,,8.0,"John Krasinski, Wendell Pierce, Michael Kelly,...",60 min,"Action, Drama, Thriller"


If instead we wanted to retrieve the TV shows that were released between two years, we can use multiple conditions.

In [136]:
# Use multiple conditions
after_2010 = imdb['Release Year'] > 2010
before_2015 = imdb['Release Year'] < 2015
imdb[after_2010 & before_2015].head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
12,Game of Thrones,2011.0,2019.0,9.2,"Emilia Clarke, Peter Dinklage, Kit Harington, ...",57 min,"Action, Adventure, Drama"
20,Happy Valley,2014.0,2023.0,8.5,"Sarah Lancashire, Siobhan Finneran, Charlie Mu...",58 min,"Crime, Drama, Thriller"
43,Peaky Blinders,2013.0,2022.0,8.8,"Cillian Murphy, Paul Anderson, Sophie Rundle, ...",60 min,"Crime, Drama"
57,Rick and Morty,2013.0,,9.1,"Justin Roiland, Chris Parnell, Spencer Grammer...",23 min,"Animation, Adventure, Comedy"
61,American Horror Story,2011.0,,8.0,"Lady Gaga, Kathy Bates, Angela Bassett, Sarah ...",60 min,"Drama, Horror, Sci-Fi"


In [137]:
# Alternative syntax
imdb[(imdb['Release Year'] > 2010) & (imdb['Release Year'] < 2015)].head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
12,Game of Thrones,2011.0,2019.0,9.2,"Emilia Clarke, Peter Dinklage, Kit Harington, ...",57 min,"Action, Adventure, Drama"
20,Happy Valley,2014.0,2023.0,8.5,"Sarah Lancashire, Siobhan Finneran, Charlie Mu...",58 min,"Crime, Drama, Thriller"
43,Peaky Blinders,2013.0,2022.0,8.8,"Cillian Murphy, Paul Anderson, Sophie Rundle, ...",60 min,"Crime, Drama"
57,Rick and Morty,2013.0,,9.1,"Justin Roiland, Chris Parnell, Spencer Grammer...",23 min,"Animation, Adventure, Comedy"
61,American Horror Story,2011.0,,8.0,"Lady Gaga, Kathy Bates, Angela Bassett, Sarah ...",60 min,"Drama, Horror, Sci-Fi"


The conditional operators in Pandas are:

| Operator | Pandas |
|:--------:|:------:|
|    and   |    &   |
|    or    |   \|   |
|   not    | ~      |

To select rows based on text data, we can use the `==` operator.

In [138]:
# Select the row for Dark
imdb[imdb['Title'] == 'Dark']

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
79,Dark,2017.0,2020.0,8.7,"Louis Hofmann, Karoline Eichhorn, Lisa Vicari,...",60 min,"Crime, Drama, Mystery"


If we are using the `==` operator, instead of using multiple conditions to check for equality, we can use the `isin()` function.

In [139]:
# Select multiple titles with conditions
imdb[(imdb['Title'] == 'Dark') | (imdb['Title'] == 'The Big Bang Theory')]

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
54,The Big Bang Theory,2007.0,2019.0,8.2,"Johnny Galecki, Jim Parsons, Kaley Cuoco, Simo...",22 min,"Comedy, Romance"
79,Dark,2017.0,2020.0,8.7,"Louis Hofmann, Karoline Eichhorn, Lisa Vicari,...",60 min,"Crime, Drama, Mystery"


In [140]:
# Select multiple titles with isin()
imdb[imdb['Title'].isin(['Dark', 'The Big Bang Theory'])]

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
54,The Big Bang Theory,2007.0,2019.0,8.2,"Johnny Galecki, Jim Parsons, Kaley Cuoco, Simo...",22 min,"Comedy, Romance"
79,Dark,2017.0,2020.0,8.7,"Louis Hofmann, Karoline Eichhorn, Lisa Vicari,...",60 min,"Crime, Drama, Mystery"


This also works for any other data type, not just text.

In [141]:
# Select all TV shows with release year 2010, 2015 and 2020
imdb[imdb['Release Year'].isin([2010, 2015, 2020])].head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
6,Alice in Borderland,2020.0,,7.7,"Kento Yamazaki, Tao Tsuchiya, Nijirô Murakami,...",50 min,"Action, Drama, Mystery"
11,The Walking Dead,2010.0,2022.0,8.1,"Andrew Lincoln, Norman Reedus, Melissa McBride...",44 min,"Drama, Horror, Thriller"
13,Emily in Paris,2020.0,,6.9,"Lily Collins, Philippine Leroy-Beaulieu, Ashle...",30 min,"Comedy, Drama, Romance"
27,Better Call Saul,2015.0,2022.0,8.9,"Bob Odenkirk, Rhea Seehorn, Jonathan Banks, Pa...",46 min,"Crime, Drama"
35,Hunters,2020.0,2023.0,7.2,"Al Pacino, Logan Lerman, Lena Olin, Jerrika Hi...",60 min,"Crime, Drama, Mystery"


You may have not noticed, but there are many duplicated titles in our dataset.

In [8]:
# Select a different show
imdb[imdb['Title'] == 'Alice in Borderland'].head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
6,Alice in Borderland,2020.0,,7.7,"Kento Yamazaki, Tao Tsuchiya, Nijirô Murakami,...",50 min,"Action, Drama, Mystery"
10006,Alice in Borderland,2020.0,,7.7,"Kento Yamazaki, Tao Tsuchiya, Nijirô Murakami,...",50 min,"Action, Drama, Mystery"
10056,Alice in Borderland,2020.0,,7.7,"Kento Yamazaki, Tao Tsuchiya, Nijirô Murakami,...",50 min,"Action, Drama, Mystery"
10106,Alice in Borderland,2020.0,,7.7,"Kento Yamazaki, Tao Tsuchiya, Nijirô Murakami,...",50 min,"Action, Drama, Mystery"
10156,Alice in Borderland,2020.0,,7.7,"Kento Yamazaki, Tao Tsuchiya, Nijirô Murakami,...",50 min,"Action, Drama, Mystery"


To check for duplicated rows we can use the `duplicated()` function.

In [12]:
# See some duplicated rows
dups = imdb[imdb.duplicated()]
dups[dups['Title'] == "Wednesday"].head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
10000,Wednesday,2022.0,,8.2,"Jenna Ortega, Hunter Doohan, Percy Hynes White...",45 min,"Comedy, Crime, Fantasy"
10050,Wednesday,2022.0,,8.2,"Jenna Ortega, Hunter Doohan, Percy Hynes White...",45 min,"Comedy, Crime, Fantasy"
10100,Wednesday,2022.0,,8.2,"Jenna Ortega, Hunter Doohan, Percy Hynes White...",45 min,"Comedy, Crime, Fantasy"
10150,Wednesday,2022.0,,8.2,"Jenna Ortega, Hunter Doohan, Percy Hynes White...",45 min,"Comedy, Crime, Fantasy"
10200,Wednesday,2022.0,,8.2,"Jenna Ortega, Hunter Doohan, Percy Hynes White...",45 min,"Comedy, Crime, Fantasy"


To drop this duplicated 

#### Location-based indexing

Pandas has two additional ways of indexing that allow us to select specified rows and columns: `.loc` and `.iloc`. 

The `.loc` method is used to obtain rows and columns with a specific label. Since we have not modified the row labels, these correspond to their integer index.

In [142]:
# Select the 11th title
imdb.loc[11, 'Title']

'The Walking Dead'

However, the most useful way to use the `.loc` method is by subsetting rows based on a condition and selection columns by name.

In [143]:
# Select the TV show titles released in 2010, 2015 and 2020
imdb.loc[imdb['Release Year'].isin([2010, 2015, 2020]), 'Title'].head()

6     Alice in Borderland
11       The Walking Dead
13         Emily in Paris
27       Better Call Saul
35                Hunters
Name: Title, dtype: object

We can use this to obtain more than one column too.

In [144]:
# Select the TV show titles and rating
imdb.loc[imdb['Release Year'].isin([2010, 2015, 2020]), ['Title', 'Release Year', 'Rating']].head()

Unnamed: 0,Title,Release Year,Rating
6,Alice in Borderland,2020.0,7.7
11,The Walking Dead,2010.0,8.1
13,Emily in Paris,2020.0,6.9
27,Better Call Saul,2015.0,8.9
35,Hunters,2020.0,7.2


On the other hand, the `.iloc` method is mainly integer position based. For example, if we want to take the first row and first to third columns.

In [145]:
# First row and first to third columns
imdb.iloc[0, 0:3]

Title           Wednesday
Release Year       2022.0
End Year              NaN
Name: 0, dtype: object

We can also obtain non consecutive rows and columns by using lists.

In [146]:
# First and fifth rows with the first and fourth columns
imdb.iloc[[0, 4], [0, 3]]

Unnamed: 0,Title,Rating
0,Wednesday,8.2
4,Jack Ryan,8.0


### Sorting your DataFrame

Sorting our data can hep us answer questions like "What is the oldest movie in our dataset?". To sort the values of our DataFrame we use the `sort()` function.

In [147]:
# Sort by release year
imdb.sort_values("Release Year").head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
4006,Looney Tunes,1930.0,2014.0,8.7,"June Foray, Mel Blanc",7 min,"Animation, Adventure, Comedy"
7333,The Woody Woodpecker Show,1940.0,1972.0,7.0,"Walter Lantz, Grace Stafford, Daws Butler, Joh...",6 min,"Animation, Short, Comedy"
1563,Tom and Jerry,1940.0,1968.0,9.0,"Mel Blanc, William Hanna, June Foray, Harry Lang",8 min,"Animation, Short, Adventure"
7519,Puppet Playhouse,1947.0,1960.0,7.7,"Bob Smith, Robert Keeshan, Dayton Allen, Bill ...",30 min,Family
4988,Toast of the Town,1948.0,1971.0,7.9,"Ed Sullivan, Johnny Wayne, Frank Shuster, Ralp...",60 min,"Music, Talk-Show"


The `sort_values()` function by default sorts our values in ascending order. But we can specify to sort them in descending order as well.

In [148]:
# Sort by release year in descending order
imdb.sort_values("Release Year", ascending = False).head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
8551,The Lion Guard Drama King,2024.0,,****,"Dallas Messenger, Alex Rush, Teyani Ferguson, ...",****,"Animation, Action, Adventure"
39868,Anne Rice's Mayfair Witches,2023.0,,6.8,"Alexandra Daddario, Jack Huston, Tongayi Chiri...",****,"Fantasy, Horror"
35716,Will Trent,2023.0,,7.7,"Deion Smith, Ramón Rodríguez, Erika Christense...",****,"Crime, Drama"
5737,Jigokuraku,2023.0,,****,"Rie Takahashi, Kenshô Ono, Chiaki Kobayashi, T...",24 min,"Animation, Action, Adventure"
10466,Will Trent,2023.0,,7.7,"Deion Smith, Ramón Rodríguez, Erika Christense...",****,"Crime, Drama"


We can sort by multiple values by passing a list. For example, we can sort by Release Year and Rating.

In [149]:
# Sort by release year and rating in descending order
imdb.sort_values(["Release Year", "Rating"], ascending = False).head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre
8551,The Lion Guard Drama King,2024.0,,****,"Dallas Messenger, Alex Rush, Teyani Ferguson, ...",****,"Animation, Action, Adventure"
2210,Crazy Fun Park,2023.0,,9.3,"Henry Strand, Stacy Clausen, Hannah Ogawa, Ped...",****,Adventure
395,Taaza Khabar,2023.0,,8.6,"Bhuvan Bam, Shriya Pilgaonkar, Shilpa Shukla, ...",****,"Action, Comedy, Drama"
1492,Legacy,2023.0,,8.3,"Matthijs van de Sande Bakhuyzen, Sallie Harmse...",****,Drama
7003,Piiritys,2023.0,,8.3,"Anna Böhm, Mikko Kauppila, Konsta Laakso, Eero...",42 min,Thriller


Note that some of our Rating and Runtime values, encoded as "\*\*\*\*", are also missing. Missing data affects our analysis in several ways. For example, if we wanted to change the Rating column to a float type, Python would throw an error.

In [150]:
# Convert Rating to float
imdb['Rating'].astype('float')

ValueError: could not convert string to float: '****'

## Dealing with missing data

Most libraries (including scikit-learn) will complain when working with missing values, even if these are stored as NaN (Not a Number). Moreover, missing values can drastically impact the quality of your models.

The first step when dealing with missing data is finding it. The `info()` function we introduced earlier is useful to detect missing values encoded as NaN. Unfortunately, as we saw, not all missing values are encoded the same way. 

Sometimes the data types of our columns may provide some information. For example, the column Rating has type object which is used to represent text.

In [151]:
# Show data types for all columns
imdb.dtypes

Title            object
Release Year    float64
End Year        float64
Rating           object
Cast             object
Runtime          object
Genre            object
dtype: object

<div class="alert alert-block alert-warning">
    <b>Note:</b> Release Year and End Year are type float64 even though years are integers. In Pandas, NaN is considered a float which forces a column of integers with missing values to become floats.
</div>

To address missing values you can either remove the columns and/or rows with missing data (not always recommended) or replace/impute these values. Multiple imputation methods exist and choosing the right one is not always easy. For demonstration purposes we will replace the missing values in the Rating column for the mean of the existing values.

In [152]:
# Extract existing values
true_rating = imdb.loc[imdb['Rating'] != "****", "Rating"].astype("float")
true_rating.head()

0    8.2
1    8.7
2    7.9
3    8.6
4    8.0
Name: Rating, dtype: float64

In [153]:
# Compute the mean
mean_rating = np.mean(true_rating).round(1)
mean_rating

7.8

Let's create a new column with the replaced values of the Rating column.

In [154]:
# Replace NaN with the mean
new_ratings = imdb['Rating'].replace(to_replace="****", value=mean_rating)

# Create the new column with the filled Rating
imdb['Rating2'] = new_ratings.astype("float")

# Take a look at the changes
imdb[imdb['Rating'] == "****"].head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre,Rating2
9,The Last of Us,2023.0,,****,"Pedro Pascal, Bella Ramsey, Gabriel Luna, Merl...",****,"Action, Adventure, Drama",7.8
74,That '90s Show,2023.0,,****,"Kurtwood Smith, Debra Jo Rupp, Callie Haverda,...",30 min,"Comedy, Drama, Romance",7.8
136,Poker Face,2023.0,,****,"Natasha Lyonne, Megan Suri, Colton Ryan, Brand...",****,Mystery,7.8
183,The Consultant,2023.0,,****,"Christoph Waltz, Nat Wolff, Brittany O'Grady, ...",****,Comedy,7.8
224,Night Court,2023.0,,****,"Melissa Rauch, John Larroquette, India de Beau...",****,Comedy,7.8


Before we fill missing values on the Runtime column we need to remove everything that is not a number in its values. It seems that the runtime for all TV shows is in minutes, but we would need some confirmation. We will start by extracting the existing values as we did before.

In [155]:
# Extract existing values
true_runtime = imdb.loc[imdb['Runtime'] != "****", "Runtime"]
true_runtime.head()

0    45 min
1    60 min
2    60 min
3    60 min
4    60 min
Name: Runtime, dtype: object

To check if all existing values match the `<number> min` format we will use a regular expression (regex) with the string function `match()`.

In [156]:
# Check for matches
format_match = true_runtime.str.match('\d+ min')
format_match.head()

0    True
1    True
2    True
3    True
4    True
Name: Runtime, dtype: bool

This regular expression matches digits from 0 to 9 (\d) one or more times (+) followed by a space and the characters "min". We will not cover regular expressions, but make sure to check the resources to learn about regular expressions.

To check if absolutely all values match the format, we will use the `all()` function.

In [157]:
# Are all runtime values in minutes?
format_match.all()

False

Now we know we may have missed something. Let's look at the values that do not match the format.

In [158]:
# Get values without a match
true_runtime[~format_match]

4371    1,290 min
Name: Runtime, dtype: object

Fortunately, we only missed a comma in our pattern. To compute the mean we will need to remove both: the comma and "min".

In [159]:
# Replace multiple values
replace_dict = {",": "", " min": ""}
true_runtime_clean = true_runtime.replace(to_replace=replace_dict, regex=True).astype("int")
true_runtime_clean.head()

0    45
1    60
2    60
3    60
4    60
Name: Runtime, dtype: int64

With this, we can now compute the mean of our runtime values and fill the missing data. Remember we removed the comma and "min" to compute the mean of the existing runtimes, but we need to remove those for the real column values too.

In [160]:
# Compute the mean runtime
mean_runtime = np.mean(true_runtime_clean).round()

# Replace missing values with the mean
replace_dict = {",": "", " min": "", "\*\*\*\*": mean_runtime}
new_runtimes = imdb["Runtime"].replace(to_replace=replace_dict, regex=True)
imdb['Runtime2'] = new_runtimes.astype("int")
imdb.head()

Unnamed: 0,Title,Release Year,End Year,Rating,Cast,Runtime,Genre,Rating2,Runtime2
0,Wednesday,2022.0,,8.2,"Jenna Ortega, Hunter Doohan, Percy Hynes White...",45 min,"Comedy, Crime, Fantasy",8.2,45
1,Yellowstone,2018.0,,8.7,"Kevin Costner, Luke Grimes, Kelly Reilly, Wes ...",60 min,"Drama, Western",8.7,60
2,The White Lotus,2021.0,2023.0,7.9,"Jennifer Coolidge, Jon Gries, F. Murray Abraha...",60 min,"Comedy, Drama",7.9,60
3,1923,2022.0,2023.0,8.6,"Harrison Ford, Helen Mirren, Brandon Sklenar, ...",60 min,"Drama, Western",8.6,60
4,Jack Ryan,2018.0,,8.0,"John Krasinski, Wendell Pierce, Michael Kelly,...",60 min,"Action, Drama, Thriller",8.0,60


Additionally, if we were to select TV shows with a Release Year and End Year, we can subset our data using the `isna()` function.

In [6]:
# Select finished shows
finished_shows = imdb[(~imdb['Release Year'].isna() & (~imdb['End Year'].isna()))]
finished_shows.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17410 entries, 2 to 49999
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Title         17410 non-null  object 
 1   Release Year  17410 non-null  float64
 2   End Year      17410 non-null  float64
 3   Rating        17410 non-null  object 
 4   Cast          17403 non-null  object 
 5   Runtime       17410 non-null  object 
 6   Genre         17410 non-null  object 
dtypes: float64(2), object(5)
memory usage: 1.1+ MB


## Writing files

Finally, writing our results to a file may be necessary to share our work or continue the analysis elsewhere. With Pandas we can write DataFrames to CSV, Excel and other file formats.

In [14]:
# Write our DataFrame to CSV
finished_shows.to_csv("finished_shows.csv", index=False)

## Download your notebook

![](https://raw.githubusercontent.com/CMC-QCL/python-data-manipulation/main/imgs/jhub_download.png)

## Digital Badge

Send your notebook with the solved hands-on activities to **qcl@cmc.edu**

## Resources

More about Pandas
* Pandas documentation (https://pandas.pydata.org/docs/user_guide/10min.html)
* Expand your skills (https://www.kaggle.com/learn/pandas)
* More on Pandas (https://realpython.com/pandas-python-explore-dataset/)

Imputation methods:
* Flexible Imputation of Missing Data by Stef van Buuren (https://stefvanbuuren.name/fimd/) - Code examples in R
* Scikit-learn documentation (https://scikit-learn.org/stable/modules/impute.html)

Learn about regular expressions:
* Regular expressions in Python (https://realpython.com/regex-python/)
* Build and test regular expressions (https://regex101.com/)
* Regex cheat sheet (https://www.rexegg.com/regex-quickstart.html)

Finally, tools change:
* Pandas vs. Polars (https://studioterabyte.nl/en/blog/polars-vs-pandas)
* Polars tutorial (https://www.codemag.com/Article/2212051/Using-the-Polars-DataFrame-Library)