# Pandas Fundamentals: Working with Movie Data

## What is Pandas?

**Pandas** is a powerful and popular open-source Python library for data manipulation and analysis. It provides high-performance, easy-to-use data structures and data analysis tools.

The two primary data structures in Pandas are:
1.  **Series:** A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). Think of it as a single column in a spreadsheet.
2.  **DataFrame:** A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it as a spreadsheet, an SQL table, or a dictionary of Series objects.

We'll be primarily working with DataFrames.

First, let's import Pandas. The common convention is to import it with the alias `pd`.

In [1]:
import pandas as pd

## Creating a Series

While we'll focus on DataFrames, understanding Series is helpful as DataFrames are composed of them.

### From a list (default index):

In [2]:
movie_ratings_list = [8.5, 7.9, 9.1, 6.5]
ratings_series = pd.Series(movie_ratings_list)
print(ratings_series)

0    8.5
1    7.9
2    9.1
3    6.5
dtype: float64


### From a list with a custom index:

In [3]:
movie_titles = ['Movie A', 'Movie B', 'Movie C', 'Movie D']
ratings_series_custom_index = pd.Series(movie_ratings_list, index=movie_titles)
print(ratings_series_custom_index)
print("\nRating for Movie B:", ratings_series_custom_index['Movie B'])

Movie A    8.5
Movie B    7.9
Movie C    9.1
Movie D    6.5
dtype: float64

Rating for Movie B: 7.9


## Creating a DataFrame

### From a Dictionary of Lists
One way to create a DataFrame is from a dictionary where keys are column names and values are lists of column data.

In [4]:
data = {
    'Title': ['The Shawshank Redemption', 'The Godfather', 'The Dark Knight'],
    'Year': [1994, 1972, 2008],
    'Rating': [9.3, 9.2, 9.0]
}
simple_df = pd.DataFrame(data)
print(simple_df)

                      Title  Year  Rating
0  The Shawshank Redemption  1994     9.3
1             The Godfather  1972     9.2
2           The Dark Knight  2008     9.0


### Loading Data from a CSV File (Online)

A very common task is to load data from a CSV (Comma Separated Values) file. Pandas makes this incredibly easy with `pd.read_csv()`.

We'll use the **MovieLens Latest Small Dataset**. Let's load the `movies.csv` file.

In [5]:
# URL for the movies.csv file from MovieLens Latest Small Dataset
movies_url = 'https://drive.google.com/uc?export=download&id=1Uztnn449pnDBDn1XGJPF6uzV34jrP1Te'

# Read the CSV file into a Pandas DataFrame
movies_df = pd.read_csv(movies_url)

print(type(movies_df))

<class 'pandas.core.frame.DataFrame'>


## Inspecting the DataFrame

Once data is loaded, the first step is always to inspect it.

### `head()` - View the first few rows
This is useful to get a quick glimpse of your data. By default, it shows the first 5 rows.

In [6]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


You can specify the number of rows:

In [7]:
movies_df.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


### `tail()` - View the last few rows

In [8]:
movies_df.tail(3)

Unnamed: 0,movieId,title,genres
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


### `shape` - Get the dimensions (rows, columns)

In [9]:
print("Shape of the DataFrame (rows, columns):", movies_df.shape)

Shape of the DataFrame (rows, columns): (9742, 3)


### `size` - Get the total number of elements

In [10]:
print("Total number of elements:", movies_df.size)

Total number of elements: 29226


### `columns` - Get the column names

In [11]:
print("Column names:", movies_df.columns)
print("As a list:", list(movies_df.columns))

Column names: Index(['movieId', 'title', 'genres'], dtype='object')
As a list: ['movieId', 'title', 'genres']


### `dtypes` - Get the data type of each column
Pandas automatically infers data types. `object` usually means string.

In [12]:
print("Data types of columns:")
print(movies_df.dtypes)

Data types of columns:
movieId     int64
title      object
genres     object
dtype: object


### `info()` - Get a concise summary of the DataFrame
This includes data types, number of non-null values, and memory usage.

In [13]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


## Selecting Data

### Selecting Columns
You can select a single column using `df['column_name']`, which returns a Series.

In [14]:
titles_series = movies_df['title']
print(type(titles_series))
titles_series.head()

<class 'pandas.core.series.Series'>


0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
Name: title, dtype: object

To select multiple columns, pass a list of column names `df[['col1', 'col2']]`, which returns a DataFrame.

In [15]:
title_and_genres_df = movies_df[['title', 'genres']]
print(type(title_and_genres_df))
title_and_genres_df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,title,genres
0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,Jumanji (1995),Adventure|Children|Fantasy
2,Grumpier Old Men (1995),Comedy|Romance
3,Waiting to Exhale (1995),Comedy|Drama|Romance
4,Father of the Bride Part II (1995),Comedy


### Selecting Rows and Columns with `loc` and `iloc`

Pandas provides two main methods for selecting rows and columns:
*   `loc`: Label-based selection. You use the actual row and column labels.
*   `iloc`: Integer position-based selection. You use integer indices (like list slicing).

#### `iloc` (Integer-based selection)
Selects data based on its integer position (0-indexed).

In [16]:
# Select the first row (index 0)
print("First row:\n", movies_df.iloc[0])
print("\nType of selection:", type(movies_df.iloc[0]))

# Select rows 0, 1, 2 (slicing is exclusive of the end index)
print("\nFirst 3 rows (DataFrame):\n", movies_df.iloc[0:3])

# Select the element at row 0, column 1 (title of the first movie)
print("\nElement at [0,1]:", movies_df.iloc[0, 1])

# Select rows 0 to 2, and columns 1 to 2 (title and genres)
print("\nRows 0-2, Columns 1-2:\n", movies_df.iloc[0:3, 1:3])

First row:
 movieId                                              1
title                                 Toy Story (1995)
genres     Adventure|Animation|Children|Comedy|Fantasy
Name: 0, dtype: object

Type of selection: <class 'pandas.core.series.Series'>

First 3 rows (DataFrame):
    movieId                    title  \
0        1         Toy Story (1995)   
1        2           Jumanji (1995)   
2        3  Grumpier Old Men (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  

Element at [0,1]: Toy Story (1995)

Rows 0-2, Columns 1-2:
                      title                                       genres
0         Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy
1           Jumanji (1995)                   Adventure|Children|Fantasy
2  Grumpier Old Men (1995)                               Comedy|Romance


#### `loc` (Label-based selection)
Selects data based on labels of rows and columns. If the DataFrame index is the default (0, 1, 2...), then `loc` can behave similarly to `iloc` for row selection, but it's conceptually different.
Slicing with `loc` is *inclusive* of the end label.

In [17]:
# Let's use the default integer index for rows for now
# Select row with index label 0
print("Row with index label 0:\n", movies_df.loc[0])

# Select rows with index labels 0, 1, 2 (inclusive)
print("\nRows with index labels 0 to 2:\n", movies_df.loc[0:2]) 

# Select row with index label 0, and column 'title'
print("\nRow 0, column 'title':", movies_df.loc[0, 'title'])

# Select rows 0-2, and columns 'title' and 'genres'
print("\nRows 0-2, columns 'title' & 'genres':\n", movies_df.loc[0:2, ['title', 'genres']])

Row with index label 0:
 movieId                                              1
title                                 Toy Story (1995)
genres     Adventure|Animation|Children|Comedy|Fantasy
Name: 0, dtype: object

Rows with index labels 0 to 2:
    movieId                    title  \
0        1         Toy Story (1995)   
1        2           Jumanji (1995)   
2        3  Grumpier Old Men (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  

Row 0, column 'title': Toy Story (1995)

Rows 0-2, columns 'title' & 'genres':
                      title                                       genres
0         Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy
1           Jumanji (1995)                   Adventure|Children|Fantasy
2  Grumpier Old Men (1995)                               Comedy|Romance


**When to use `loc` vs `iloc`?**
*   Use `iloc` when you know the integer positions.
*   Use `loc` when you know the labels. It's generally safer if your index isn't a simple range or if column order might change, as it's explicit about what you're selecting.

## Filtering Data (Boolean Indexing)

You can filter rows based on conditions. This is a very powerful feature.

First, let's see what a condition looks like. It returns a Series of Booleans:

In [18]:
# Condition: movieId greater than 10
condition_movieId = movies_df['movieId'] > 10
print(condition_movieId.head())

0    False
1    False
2    False
3    False
4    False
Name: movieId, dtype: bool


Now, use this Boolean Series to filter the DataFrame:

In [19]:
filtered_movies = movies_df[condition_movieId]
# or more concisely: movies_df[movies_df['movieId'] > 10]
filtered_movies.head()

Unnamed: 0,movieId,title,genres
10,11,"American President, The (1995)",Comedy|Drama|Romance
11,12,Dracula: Dead and Loving It (1995),Comedy|Horror
12,13,Balto (1995),Adventure|Animation|Children
13,14,Nixon (1995),Drama
14,15,Cutthroat Island (1995),Action|Adventure|Romance


### Filtering by String Content
The `.str` accessor allows us to use string methods on a Series.
Let's find movies containing 'Adventure' in their genres.

In [20]:
adventure_movies = movies_df[movies_df['genres'].str.contains('Adventure')]
adventure_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
7,8,Tom and Huck (1995),Adventure|Children
9,10,GoldenEye (1995),Action|Adventure|Thriller
12,13,Balto (1995),Adventure|Animation|Children


### Combining Conditions
Use `&` for AND, `|` for OR. Wrap individual conditions in parentheses `()`.

Let's find 'Adventure' movies that also contain 'Children'.

In [21]:
adventure_children_movies = movies_df[
    (movies_df['genres'].str.contains('Adventure')) & 
    (movies_df['genres'].str.contains('Children'))
]
adventure_children_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
7,8,Tom and Huck (1995),Adventure|Children
12,13,Balto (1995),Adventure|Animation|Children
53,60,"Indian in the Cupboard, The (1995)",Adventure|Children|Fantasy


## Sorting Data

The `sort_values()` method sorts a DataFrame by one or more columns.

In [22]:
# Sort movies by title alphabetically (ascending by default)
sorted_by_title = movies_df.sort_values(by='title')
sorted_by_title.head()

Unnamed: 0,movieId,title,genres
8600,117867,'71 (2014),Action|Drama|Thriller|War
8014,97757,'Hellboy': The Seeds of Creation (2004),Action|Adventure|Comedy|Documentary|Fantasy
5528,26564,'Round Midnight (1986),Drama|Musical
5690,27751,'Salem's Lot (2004),Drama|Horror|Mystery|Thriller
614,779,'Til There Was You (1997),Drama|Romance


In [23]:
# Sort movies by title in descending order
sorted_by_title_desc = movies_df.sort_values(by='title', ascending=False)
sorted_by_title_desc.head()

Unnamed: 0,movieId,title,genres
3947,5560,À nous la liberté (Freedom for Us) (1931),Comedy|Musical
1866,2478,¡Three Amigos! (1986),Comedy|Western
5882,33158,xXx: State of the Union (2005),Action|Crime|Thriller
3920,5507,xXx (2002),Action|Crime|Thriller
1961,2600,eXistenZ (1999),Action|Sci-Fi|Thriller


**Important Note:** Most Pandas operations, like `sort_values()`, return a *new* DataFrame by default. The original DataFrame `movies_df` is unchanged.
If you want to modify the original DataFrame, you can either reassign it:
`movies_df = movies_df.sort_values(by='title')`
or use the `inplace=True` argument (use with caution):
`movies_df.sort_values(by='title', inplace=True)`

In [24]:
print("Original DataFrame head after sorting (no inplace=True):")
movies_df.head() # Will show original order

Original DataFrame head after sorting (no inplace=True):


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Basic Descriptive Statistics

The `describe()` method provides descriptive statistics for numerical columns.

In [25]:
# For our movies_df, only 'movieId' is purely numeric by default from source
movies_df.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


For non-numerical columns, you can use `describe(include='object')` or `describe(include='all')`.

In [26]:
movies_df.describe(include='object')

Unnamed: 0,title,genres
count,9742,9742
unique,9737,951
top,Emma (1996),Drama
freq,2,1053


### Unique Values and Value Counts
Often, you want to know the unique values in a column or how many times each value appears.

`nunique()`: Counts the number of unique values.

In [27]:
print("Number of unique genres combinations:", movies_df['genres'].nunique())

Number of unique genres combinations: 951


`value_counts()`: Returns a Series containing counts of unique values.

In [32]:
print("Top 10 most common genre combinations:")
movies_df['genres'].value_counts().head(10)

Top 10 most common genre combinations:


genres
Drama                   1053
Comedy                   946
Comedy|Drama             435
Comedy|Romance           363
Drama|Romance            349
Documentary              339
Comedy|Drama|Romance     276
Drama|Thriller           168
Horror                   167
Horror|Thriller          135
Name: count, dtype: int64

## Next Steps

In the next session, we'll explore:
*   Handling missing data (briefly).
*   Adding and modifying columns.
*   Grouping data with `groupby()`.
*   Merging DataFrames (e.g., combining movies with their ratings).
*   More advanced string operations and applying functions.
*   And how these concepts can lead to building simple recommendation logic.