# Introduction to Pandas & Recommendation Systems

## Pandas: Data Handling and Manipulation

**Pandas** is a powerful open-source Python library specifically designed for data handling, manipulation, and analysis.

Before using Pandas, you need to import it into your code. The common convention is to import it with the alias `pd`.

In [None]:
import pandas as pd

## Core Pandas Data Structures

Pandas introduces two primary data structures:

1.  **Series:** A one-dimensional labeled array capable of holding data of any type (integers, strings, floats, Python objects, etc.). Think of it as a single column in a spreadsheet or a labeled list.
2.  **DataFrame:** A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled rows and columns. Think of it as a spreadsheet, an SQL table, or a dictionary of Series objects.

## Pandas Series

Let's create a Series to store average movie ratings.

The `pd.Series()` function is used to create a Pandas Series.

### Creating a Series from a List (Default Index)

If you don't specify an index, Pandas creates a default integer index starting from 0.

In [None]:
avgRating_list = [7.4, 8.0, 6.1, 8.8]

avgRating_series_default_index = pd.Series(avgRating_list)

print(avgRating_series_default_index)

### Creating a Series with a Custom Index

Unlike Python lists which always use zero-based integer indices, Pandas Series can have explicitly defined labels for their index. This makes data more meaningful and easier to access.

Let's use `movieId`s as the index for our movie ratings Series.

In [None]:
avgRating_list = [7.4, 8.0, 6.1, 8.8]
movieIds = ['A', 'B', 'C', 'D']

avgRating_series_custom_index = pd.Series(avgRating_list, index=movieIds)

print(avgRating_series_custom_index)

### Accessing Elements in a Series

You can access elements in a Series using:
*   Its **custom label index** (if defined).
*   Its **integer position** (like a list, starting from 0), even if it has a custom label index.

In [None]:
# Access using the custom index label 'A'
rating_A = avgRating_series_custom_index['A']
print(f"Rating for movie 'A': {rating_A}")

# Access using the integer position 0
rating_pos_0 = avgRating_series_custom_index[0]
print(f"Rating at position 0: {rating_pos_0}")

### Quiz: Series Access

What will be the output of the following code?
```python
import pandas as pd
list_b = [5, 6, 7, 8]
index_b = ['a', 'b', 'c', 'd']
series_b = pd.Series(list_b, index=index_b)
print(series_b[0])
```

In [None]:
list_b_quiz = [5, 6, 7, 8]
index_b_quiz = ['a', 'b', 'c', 'd']
series_b_quiz = pd.Series(list_b_quiz, index=index_b_quiz)
print(series_b_quiz[0])
print("Explanation: Even with a custom label index, Pandas Series also support integer-based positional indexing.")

## Pandas DataFrame

A DataFrame is a 2D labeled data structure with columns of potentially different types. It's like a table or a spreadsheet.

You can create a DataFrame from various inputs, such as:
*   A Python dictionary of lists (or Series).
*   A list of dictionaries.
*   A NumPy ndarray.
*   Another DataFrame.

Let's create a DataFrame for movie information (movieId, year, genres).

In [None]:
movies_dict = {
    'movieId': ['A', 'B', 'C', 'D'],
    'year': [2023, 2013, 2023, 2010],
    'genres': ['Adventure|Comedy', 'Romance|Sci-Fi', 'Adventure', 'Adventure|Sci-Fi']
}

movies_df = pd.DataFrame(movies_dict)

print(movies_df)

### Practice: Create a Ratings DataFrame

**Task:** Create a DataFrame named `ratings_df` with columns 'avgRating' and 'numRatings'. Use a list of movie IDs (`['A', 'B', 'C', 'D']`) as the **index** of the DataFrame.

**Data:**
*   avgRating: `[7.4, 8.0, 6.1, 8.8]`
*   numRatings: `[224000, 664000, 70000, 25000000]`
*   movieIds (for index): `['A', 'B', 'C', 'D']`

In [None]:
avgRating_data = [7.4, 8.0, 6.1, 8.8]
numRatings_data = [224000, 664000, 70000, 25000000]
movieIds_index = ['A', 'B', 'C', 'D']

ratings_data_dict = {
    'avgRating': avgRating_data,
    'numRatings': numRatings_data
}

ratings_df = pd.DataFrame(ratings_data_dict, index=movieIds_index)

print(ratings_df)

### Adding a Column from the Index & Resetting Index

Often, the index contains useful data that you might want as a regular column.

**Task:** 
1.  For the `ratings_df` created above (which has 'A', 'B', 'C', 'D' as its index), create a new column named 'movieId' and populate it with the values from the DataFrame's current index.
2.  Then, reset the index of `ratings_df` to be the default numerical index (0, 1, 2, ...). The `reset_index()` method is useful here.

**Quiz Hint (from slides):** `ratings_df['movieId'] = ratings_df.index`

In [None]:
print("Original ratings_df with custom index:")
print(ratings_df)

# 1. Add 'movieId' column from the index
ratings_df['movieId'] = ratings_df.index

print("
ratings_df after adding 'movieId' column:")
print(ratings_df)

# 2. Reset the index. `drop=True` prevents the old index from being added as a new column.
# If you wanted the old index as a column, you'd omit drop=True or use reset_index(inplace=False) and reassign.
ratings_df.reset_index(drop=True, inplace=True) 

print("
ratings_df after resetting index:")
print(ratings_df)

## Combining DataFrames: Join/Merge

Pandas provides powerful ways to combine DataFrames, similar to SQL joins. The `pd.merge()` function is commonly used.

**Task:** Combine `movies_df` and `ratings_df` (assuming `ratings_df` now has a 'movieId' column and a numerical index, and `movies_df` also has a 'movieId' column).

You need to decide:
*   **On which column(s) to join:** Common column(s) that link the two DataFrames (e.g., 'movieId').
*   **How to join:** The type of join (inner, outer, left, right).

**Quiz Question (from slides):** How would you combine them considering the common column `movieId` and an appropriate join type?

In [None]:
# Recreate movies_df (already has movieId)
movies_dict_re = {
    'movieId': ['A', 'B', 'C', 'D'],
    'year': [2023, 2013, 2023, 2010],
    'genres': ['Adventure|Comedy', 'Romance|Sci-Fi', 'Adventure', 'Adventure|Sci-Fi']
}
movies_df_re = pd.DataFrame(movies_dict_re)

# ratings_df from previous step now has 'movieId' and numeric index
# For clarity, let's ensure ratings_df is as expected for merging:
ratings_data_for_merge = {
    'movieId': ['A', 'B', 'C', 'D', 'E'], # 'E' won't match in movies_df_re for inner/left join on movies_df_re
    'avgRating': [7.4, 8.0, 6.1, 8.8, 7.0],
    'numRatings': [224000, 664000, 70000, 25000000, 100000]
}
ratings_df_for_merge_ready = pd.DataFrame(ratings_data_for_merge)

print("movies_df_re:")
print(movies_df_re)
print("ratings_df_for_merge_ready:")
print(ratings_df_for_merge_ready)

# Combine using pd.merge(). An 'inner' join is common if you only want matching movieIds.
# 'on' specifies the column(s) to join on.
merged_df = pd.merge(movies_df_re, ratings_df_for_merge_ready, on='movieId', how='inner')
# Other options for 'how': 'left', 'right', 'outer'

print("Merged DataFrame (inner join on 'movieId'):")
print(merged_df)

## Introduction to Recommendation Systems

A **recommendation system** is an algorithm designed to provide users with personalized suggestions, recommendations, or predictions for items (e.g., movies, products, articles).

**Primary Goal:** Assist users in discovering items they are likely to be interested in, based on:
*   Their preferences.
*   Their past behaviors.
*   Historical data of other users.
*   Item characteristics.

Examples: Netflix movie recommendations, Amazon product suggestions, Spotify music playlists, Zillow property suggestions.

## Building a Simple Movie Recommendation System (Conceptual)

Let's use our combined movie data (`merged_df` from the previous step) to think about how we could build a very basic recommendation system.

We can ask the user questions about their preferences and filter the DataFrame accordingly.

### Questions to Guide Recommendations:

**Year-related:**
*   Do you prefer newer movies (e.g., released after a certain year)?

**Rating and Popularity:**
*   Are you more inclined towards movies with higher average ratings, or are you open to exploring movies with slightly lower ratings but potentially exciting plots?
*   Do you prefer highly popular movies (many ratings) or hidden gems (fewer ratings but potentially good)?

**Genre-specific:**
*   What specific genre(s) are you interested in (e.g., Adventure, Comedy, Romance, Sci-Fi)?

### [Practice] Movie Recommendation Filtering

Using the `merged_df` (or recreate a similar one if needed), implement filtering based on user preferences.

**Tasks (perform these sequentially or allow user to choose):**

1.  **Filter by Year:** Ask the user if they prefer movies after a certain year (e.g., 2020). If yes, filter the DataFrame. 
2.  **Filter by Average Rating:** Ask the user for a minimum average rating. Filter based on this.
3.  **Filter by Popularity (Number of Ratings):** Ask if they prefer popular movies (e.g., more than 500,000 ratings) or less popular ones. Filter accordingly.
4.  **Filter by Genre:** Ask the user for a genre. Filter movies that contain this genre in their 'genres' string. 

In [None]:
# Recreate a sample merged_df for this exercise
data_rec = {
    'movieId': ['A', 'B', 'C', 'D', 'E', 'F'],
    'year': [2023, 2013, 2023, 2010, 2021, 2019],
    'genres': ['Adventure|Comedy', 'Romance|Sci-Fi', 'Adventure', 'Adventure|Sci-Fi', 'Comedy|Drama', 'Action|Adventure'],
    'avgRating': [7.4, 8.0, 6.1, 8.8, 7.9, 8.5],
    'numRatings': [224000, 664000, 70000, 25000000, 450000, 1200000]
}
merged_df_rec = pd.DataFrame(data_rec)
print("Original DataFrame for Recommendation:")
print(merged_df_rec)

filtered_recommendations = merged_df_rec.copy() # Start with a copy to filter

# 1. Filter by Year
year_pref_input = input("Do you prefer movies after a certain year? (yes/no): ").lower().strip()
if year_pref_input == 'yes':
    try:
        year_cutoff = int(input("Enter the year (e.g., 2020): "))
        filtered_recommendations = filtered_recommendations[filtered_recommendations['year'] > year_cutoff]
    except ValueError:
        print("Invalid year entered.")

# 2. Filter by Average Rating
rating_pref_input = input("Do you have a minimum average rating preference? (yes/no): ").lower().strip()
if rating_pref_input == 'yes':
    try:
        min_rating = float(input("Enter minimum average rating (e.g., 7.5): "))
        filtered_recommendations = filtered_recommendations[filtered_recommendations['avgRating'] >= min_rating]
    except ValueError:
        print("Invalid rating entered.")

# 3. Filter by Popularity (Number of Ratings)
try:
    min_num_ratings = int(input("Minimum number of ratings? (e.g., 500000, or 0 for no preference): "))
    if min_num_ratings > 0:
        filtered_recommendations = filtered_recommendations[filtered_recommendations['numRatings'] >= min_num_ratings]
except ValueError:
    print("Invalid number for ratings count.")

# 4. Filter by Genre
genre_input = input("Enter a genre you are interested in (e.g., Adventure - leave blank for no preference): ").strip()
if genre_input: # If user entered something
    # 'case=False' makes the search case-insensitive
    # 'na=False' handles potential NaN values in genres column gracefully if they existed
    filtered_recommendations = filtered_recommendations[filtered_recommendations['genres'].str.contains(genre_input, case=False, na=False)]

print("Filtered Movie Recommendations:")
if not filtered_recommendations.empty:
    print(filtered_recommendations)
else:
    print("No movies match your criteria.")