# Pandas Fundamentals: MovieLens Data Exploration Lab

## Introduction

Welcome to the Pandas Fundamentals Lab. In this lab, you will apply the concepts you've learned to explore and analyze the MovieLens dataset. You'll be working with three files:

1.  `movies.csv`: Contains movie information (movieId, title, genres).
2.  `ratings.csv`: Contains ratings given by users to movies (userId, movieId, rating, timestamp).
3.  `tags.csv`: Contains tags applied by users to movies (userId, movieId, tag, timestamp).

**Your Goal:** Answer the questions and complete the tasks in each section. Use the Pandas skills you've learned, including loading data, inspecting DataFrames, selecting data, filtering, sorting, and basic descriptive statistics.

**Remember to:**
*   Read the questions carefully.
*   Use the hints if you get stuck.
*   Experiment. Pandas is best learned by doing.

## Setup

First, let's import the Pandas library.

In [1]:
# your code here

## Part 1: Exploring `movies.csv`

### 1.1: Load the `movies.csv` dataset
Load the `movies.csv` file into a Pandas DataFrame called `movies_df`.

**Hint:** The URL for the file is `https://files.grouplens.org/datasets/movielens/ml-latest-small/movies.csv`

In [None]:
movies_url = 'https://drive.google.com/uc?export=download&id=1Uztnn449pnDBDn1XGJPF6uzV34jrP1Te'
# Your code here to load the data into movies_df
# your code here

# Display the first few rows to confirm loading
# your code here

### 1.2: Inspect the `movies_df` DataFrame

a. How many movies are in this dataset? (i.e., how many rows and columns?)
   **Hint:** Use the `.shape` attribute.

b. Display the column names of `movies_df`.
   **Hint:** Use the `.columns` attribute.

c. What are the data types of each column in `movies_df`?
   **Hint:** Use the `.dtypes` attribute or `.info()` method.

d. Display the first 10 movies in the DataFrame.
   **Hint:** Use the `.head()` method.

e. Display the last 7 movies in the DataFrame.
   **Hint:** Use the `.tail()` method.

In [None]:
# a. Number of movies (rows and columns)
print("Shape of movies_df:") 
# your code here

# b. Column names
print("\nColumn names:")
# your code here) 

# c. Data types
print("\nData types:")
# your code here)

# d. First 10 movies
print("\nFirst 10 movies:")
# your code here)

# e. Last 7 movies
print("\nLast 7 movies:")
# your code here)

### 1.3: Selecting Data from `movies_df`

a. Select and display only the `title` column for all movies. What is the type of this selection?
   **Hint:** `df['column_name']`.

b. Select and display the `title` and `genres` columns for all movies. What is the type of this selection?
   **Hint:** `df[['col1', 'col2']]`.

c. Using `iloc`, select and display the movie at the 100th row (integer position 99).

d. Using `iloc`, select and display the `title` (which is at column index 1) of the movie at the 100th row (integer position 99).

e. Using `loc`, select and display all information for the movie with `movieId` 1. (First, you might need to set `movieId` as the index, or filter for it).
   *Alternative for now if you don't want to change index:* Filter the DataFrame for `movieId == 1` and display it.

f. Using `loc` (and assuming `movieId` is NOT the index yet), select the `genres` of the movie at index position 5.
   **Hint:** `df.loc[index_label, column_label]`

In [None]:
# a. Select 'title' column
# Your code here
print("Titles Series (first 5):")
# your code here
print("Type of selection:")
# your code here

# b. Select 'title' and 'genres' columns
# your code here
print("\nTitle and Genres DataFrame (first 5):")
# your code here
print("Type of selection:")
# your code here

# c. Movie at 100th row using iloc
# your code here

# d. Title of the movie at 100th row using iloc
# your code here

# e. Movie with movieId 1 (using filtering, as index is not movieId yet)
# your code here

# f. Genres of the movie at index position 5 using loc
# your code here

### 1.4: Filtering `movies_df`

a. Find and display all movies that have 'Animation' in their `genres`.
   **Hint:** Use `.str.contains()`.

b. Find and display all movies that are 'Comedy' AND 'Romance'.
   **Hint:** Use `&` and parentheses for conditions: `(condition1) & (condition2)`.

c. Find and display all movies that are 'Horror' OR 'Sci-Fi'.
   **Hint:** Use `|`.

d. Find and display all 'Drama' movies that do NOT contain 'Romance' in their genres.
   **Hint:** Use `~` for NOT.

e. **Challenge:** Extract the year from the `title` string (e.g., "Toy Story (1995)" -> 1995). Create a new column called `year`. Then, find all movies released in 1995.
   **Hints:** 
     *   `movies_df['title'].str.extract(r'\(\d{4}\)$')` can extract the 4 digits in parentheses at the end.
     *   The result of `extract` might be a string; convert it to numeric using `pd.to_numeric()`.
     *   Handle potential `NaN` values if some titles don't have a year in that format.

In [None]:
# a. Animation movies
# your code here

# b. Comedy AND Romance movies
# your code here

# c. Horror OR Sci-Fi movies
# your code here

# d. Drama movies WITHOUT Romance
# your code here

# e. Challenge: Movies from 1995
# your code here

# your code here

# Display info to see the new 'year' column and its type
# your code here

### 1.5: Sorting `movies_df`

a. Sort the `movies_df` by `title` alphabetically (A-Z). Display the head of the sorted DataFrame.
   **Hint:** Use `.sort_values(by='column_name')`.

b. Sort the `movies_df` by `movieId` in descending order. Display the head of this sorted DataFrame.
   **Hint:** Use `ascending=False`.

c. If you created the `year` column in 1.4e, sort the movies by `year` (most recent first), and then by `title` (alphabetically) for movies released in the same year. Display the head.
   **Hint:** Pass a list to `by`: `by=['col1', 'col2']` and a list to `ascending`: `ascending=[False, True]`.

In [None]:
# a. Sort by title alphabetically
# your code here

# b. Sort by movieId descending
# your code here

# c. Sort by year (desc) then title (asc)
# your code here

### 1.6: Basic Descriptive Statistics for `movies_df`

a. How many unique genre combinations are there in the `genres` column?
   **Hint:** Use `.nunique()` on the `genres` Series.

b. What are the top 10 most common genre combinations and their counts?
   **Hint:** Use `.value_counts()`.

c. Use the `.describe()` method on `movies_df`. What does it tell you about the `movieId` column? If you created the `year` column, what does it tell you about movie release years?

In [None]:
# a. Number of unique genre combinations
# your code here

# b. Top 10 most common genre combinations
# your code here

# c. Describe movies_df
# your code here

---

## Part 2: Exploring `ratings.csv`

### 2.1: Load the `ratings.csv` dataset
Load the `ratings.csv` file into a Pandas DataFrame called `ratings_df`.

**Hint:** The URL is `https://files.grouplens.org/datasets/movielens/ml-latest-small/ratings.csv`

In [None]:
ratings_url = 'https://drive.google.com/uc?export=download&id=15IOZXK7f8nvfxhUXap9VEnZDzBTyu2xI'

# Your code here to load the data into ratings_df


### 2.2: Inspect the `ratings_df` DataFrame

a. How many ratings are in this dataset?
b. What are the data types of each column?
c. Get a statistical summary of the `rating` column. What are the minimum, maximum, and average ratings?
   **Hint:** Use `.describe()` on the 'rating' column or the whole DataFrame.

In [None]:
# a. Number of ratings
# your code here

# b. Data types
print("\nData types:")
# your code here

# c. Statistical summary of 'rating' column
print("\nStatistical summary of 'rating' column:")
# your code here

### 2.3: Filtering `ratings_df`

a. Find and display all ratings given by `userId` 1.

b. Find and display all ratings for `movieId` 318 (The Shawshank Redemption).

c. Find and display all ratings that are exactly 5.0.

d. Find and display all ratings that are less than or equal to 2.0.

e. Find and display all ratings given by `userId` 50 for movies with a rating of 4.0 or higher.

In [None]:
# a. Ratings by userId 1
# your code here

# b. Ratings for movieId 318
# your code here

# c. Ratings of 5.0
# your code here

# d. Ratings <= 2.0
# your code here

# e. Ratings by userId 50 that are >= 4.0
# your code here

### 2.4: Sorting `ratings_df`

a. Sort `ratings_df` by the `rating` column in descending order. Display the head.

b. Sort `ratings_df` first by `userId` (ascending) and then by `timestamp` (ascending) for ratings from the same user. Display the head.

In [None]:
# a. Sort by rating descending
# your code here

# b. Sort by userId (asc) then timestamp (asc)
# your code here

### 2.5: Basic Descriptive Statistics for `ratings_df`

a. How many unique users are there in the dataset?
   **Hint:** Use `.nunique()` on `userId`.

b. What is the distribution of ratings? (i.e., how many 0.5, 1.0, 1.5 ... 5.0 ratings are there?)
   **Hint:** Use `.value_counts()` on the `rating` column. Sort the index for better readability if needed (`.sort_index()`).

c. Which user (`userId`) has given the most ratings? How many ratings did they give?
   **Hint:** Use `.value_counts()` on `userId`.

In [None]:
# a. Number of unique users
# your code here

# b. Distribution of ratings
# your code here

# c. User with the most ratings
# your code here

---

## Part 3: Exploring `tags.csv`

### 3.1: Load the `tags.csv` dataset
Load the `tags.csv` file into a Pandas DataFrame called `tags_df`.

**Hint:** The URL is `https://files.grouplens.org/datasets/movielens/ml-latest-small/tags.csv`

In [None]:
tags_url = 'https://drive.google.com/uc?export=download&id=1nsxkCNkl64CBCVqDu2kdqII63i0EMBoS'

# Your code here to load the data into tags_df


### 3.2: Inspect the `tags_df` DataFrame

a. How many tags are in this dataset?
b. What are the data types? Are there any columns you might want to convert later (e.g., 'tag' to lowercase)?
c. How many unique tags are there (case-sensitive for now)?
   **Hint:** `.nunique()` on the `tag` column.

In [None]:
# a. Number of tags
# your code here

# b. Data types
print("\nData types:")
# your code here

# c. Number of unique tags (case-sensitive)
# your code here

### 3.3: Filtering and Manipulating `tags_df`

a. Create a new column `tag_lower` which is the lowercase version of the `tag` column.
   **Hint:** Use `.str.lower()`.

b. Now, how many unique tags are there if we consider them case-insensitively (using `tag_lower`)?

c. Find and display all tags applied by `userId` 2 for `movieId` 60756.

d. Find and display all tags that contain the word 'funny' (case-insensitive search on `tag_lower`).

e. What are the top 10 most frequently applied tags (case-insensitive, using `tag_lower`)?

In [None]:
# a. Create 'tag_lower' column
# your code here

# b. Number of unique tags (case-insensitive)
# your code here

# c. Tags by userId 2 for movieId 60756
# your code here

# d. Tags containing 'funny' (case-insensitive)
# your code here

# e. Top 10 most frequent tags (case-insensitive)
# your code here

---

--- End of Lab ---