# 2.2 Data Exloration with Pandas

Now that we have covered some basic numpy and pandas concepts it's time to load some data and do some exploration to get more familiar with using pandas dataframes.

## Load Some Movie Reviews

Provided with this data are some movie reviews extracted from https://www.imdb.com/interfaces/. See the extras folder for how these were downloaded and modified.

In [56]:
import pandas as pd
import numpy as np

In [23]:
df = pd.read_csv("data/imdb.csv")

# Show the first few rows of data with df.head()
df.head()

Unnamed: 0,id,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0.0,1894,\N,1,"Documentary,Short",5.6,1643
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0.0,1892,\N,5,"Animation,Short",6.1,198
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0.0,1892,\N,4,"Animation,Comedy,Romance",6.5,1336
3,tt0000004,short,Un bon bock,Un bon bock,0.0,1892,\N,12,"Animation,Short",6.2,120
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0.0,1893,\N,1,"Comedy,Short",6.1,2119


Lets look at the documentation for the columns provided by imdb:
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

If we call df.info() we can get an idea of our datatypes and other info:


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 761138 entries, 0 to 761137
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              761138 non-null  object 
 1   titleType       761138 non-null  object 
 2   primaryTitle    761138 non-null  object 
 3   originalTitle   761138 non-null  object 
 4   isAdult         761138 non-null  float64
 5   startYear       761138 non-null  object 
 6   endYear         761138 non-null  object 
 7   runtimeMinutes  761138 non-null  object 
 8   genres          761137 non-null  object 
 9   averageRating   761138 non-null  float64
 10  numVotes        761138 non-null  int64  
dtypes: float64(2), int64(1), object(8)
memory usage: 63.9+ MB


From this we can see that some columns arent in the format we'd want. runtimeMinutes, startYear and endYear are currently `object` columns, isAdult is a float64 column when it could be a boolean.

Lets fix these:

In [25]:
# Convert isAdult by simply changin it's type.
df['isAdult'] = df['isAdult'].astype(bool)

In [27]:
# We can convert string columns to numeric with pd.to_numeric()
df['startYear'] = pd.to_numeric(df['startYear'], errors='coerce') # turn non-numeric records into np.nan
df['endYear'] = pd.to_numeric(df['endYear'], errors='coerce') # turn non-numeric records into np.nan
df['runtimeMinutes'] = pd.to_numeric(df['runtimeMinutes'], errors='coerce') # turn non-numeric records into np.nan

# Check our info again
df.info()

df.head(30)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 761138 entries, 0 to 761137
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              761138 non-null  object 
 1   titleType       761138 non-null  object 
 2   primaryTitle    761138 non-null  object 
 3   originalTitle   761138 non-null  object 
 4   isAdult         761138 non-null  bool   
 5   startYear       761071 non-null  float64
 6   endYear         23813 non-null   float64
 7   runtimeMinutes  590656 non-null  float64
 8   genres          761137 non-null  object 
 9   averageRating   761138 non-null  float64
 10  numVotes        761138 non-null  int64  
dtypes: bool(1), float64(4), int64(1), object(5)
memory usage: 58.8+ MB


Unnamed: 0,id,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,False,1894.0,,1.0,"Documentary,Short",5.6,1643
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,False,1892.0,,5.0,"Animation,Short",6.1,198
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,False,1892.0,,4.0,"Animation,Comedy,Romance",6.5,1336
3,tt0000004,short,Un bon bock,Un bon bock,False,1892.0,,12.0,"Animation,Short",6.2,120
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,False,1893.0,,1.0,"Comedy,Short",6.1,2119
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,False,1894.0,,1.0,Short,5.3,115
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,False,1894.0,,1.0,"Short,Sport",5.5,650
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,False,1894.0,,1.0,"Documentary,Short",5.4,1797
8,tt0000009,movie,Miss Jerry,Miss Jerry,False,1894.0,,45.0,Romance,5.9,154
9,tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,False,1895.0,,1.0,"Documentary,Short",6.9,6005


Questions to answer:

1. How many items are in the dataset? *(easy)*
2. What's the oldest movie in the dataset? *(easy)*
3. What movie has the most rating votes? ? *(easy)*
4. In which year were the most movies released? *(medium)*
5. Which movie has the lowest rating? Use the number of votes cast as a tiebreaker. *(medium)*
6. Which genre has the most movies? *(extra hard)*

In [39]:
# Add your soltion to question 1 here. 
# Hint: Try df.shape

df.shape[0]

761138

In [40]:
# Add your solution to question 2 here.
# Hint 1: try out the sort_values method function on the dataframe.
# Hint 2: try the idxmin method on the startYear column

df.loc[df['startYear'].idxmin()]

id                        tt3155794
titleType                     short
primaryTitle       Passage de Venus
originalTitle      Passage de Venus
isAdult                       False
startYear                      1874
endYear                         NaN
runtimeMinutes                    1
genres            Documentary,Short
averageRating                   6.9
numVotes                       1198
Name: 558470, dtype: object

In [41]:
# Add your solution to question 3 here.
# Hint - similar to question 2
df.loc[df['numVotes'].idxmax()]

id                               tt0111161
titleType                            movie
primaryTitle      The Shawshank Redemption
originalTitle     The Shawshank Redemption
isAdult                              False
startYear                             1994
endYear                                NaN
runtimeMinutes                         142
genres                               Drama
averageRating                          9.3
numVotes                           2271679
Name: 78079, dtype: object

In [42]:
# Add your solution to question 4 here:
# Hint: try out the value_counts function on the startYear column

df['startYear'].value_counts()

2016.0    31944
2017.0    31748
2015.0    30180
2018.0    29874
2014.0    29095
          ...  
1881.0        1
2021.0        1
1883.0        1
1885.0        1
1874.0        1
Name: startYear, Length: 140, dtype: int64

In [43]:
# Add your solution to question 5 here:
# Hint, try the sort_values function with some extra arguments.

df.sort_values(by=['averageRating','numVotes'], ascending=[True, False])


Unnamed: 0,id,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
705191,tt7221896,movie,Cumali Ceber,Cumali Ceber: Allah Seni Alsin,False,2017.0,,100.0,Comedy,1.0,37571
478664,tt1852818,video,Justin Bieber: Rise to Fame,Justin Bieber: Rise to Fame,False,2011.0,,65.0,Documentary,1.0,2728
482063,tt1885205,video,Biebermania!,Biebermania!,False,2011.0,,70.0,"Documentary,Music",1.0,2072
485808,tt1934806,tvSeries,Sasural Simar Ka,Sasural Simar Ka,False,2011.0,2018.0,22.0,"Drama,Fantasy",1.0,1392
383233,tt11766306,tvSeries,Attaway General,Attaway General,False,2020.0,,,Drama,1.0,842
...,...,...,...,...,...,...,...,...,...,...,...
752277,tt9318118,short,Last Remains,Last Remains,False,2020.0,,,"Drama,Short",10.0,5
752579,tt9337212,video,The Nation Of Bangladesh,The Nation Of Bangladesh,False,2018.0,,22.0,Documentary,10.0,5
753542,tt9394856,movie,Humans of Our World: The Journey,Humans of Our World: The Journey,False,2019.0,,,Documentary,10.0,5
755949,tt9572522,short,Rhino,Rhino,False,2018.0,,19.0,"Drama,Short",10.0,5


In [71]:
# Add your solution to question 6 here,
# This one is really for extra credit - I havent covered everything you'd need to know to get this yet.

# We'll need to first unpack the string data for the genres column.
# First split the genres column into lists, and then use the explode() method to turn those lists into separate rows.
genres = df['genres'].str.split(',').explode()

# We can then call value_counts on genres to find the biggest one.
genres.value_counts()



Drama          273300
Comedy         217337
Short          131468
Documentary     91224
Action          79828
Crime           75162
Romance         61476
Adventure       59652
Animation       53524
Family          45748
Music           39066
Horror          36863
Mystery         35084
Thriller        34948
Fantasy         30057
Sci-Fi          21988
Reality-TV      19421
Talk-Show       18986
\N              17948
History         16963
Biography       14938
Adult           14681
Sport           12147
Game-Show       10917
War              9235
Musical          9107
Western          8950
News             8352
Film-Noir         767
Name: genres, dtype: int64