# Series Methods and Operations in Pandas
_In this section, you will explore the powerful methods available for working with Pandas Series. You will learn how to apply built-in methods, perform operations on Series, and effectively chain multiple methods to transform and analyze data efficiently._

---

## Contents
1. **Introduction**  
   - Understanding Pandas Series  
   - Differences between Series and DataFrame  
   - Importance of Series Methods in Data Analysis  
   
2. **Key Concepts**  
   - Series 
     - Calling Series Methods  
     - Series Operations  
     - Chaining Series Methods  

3. **Practical Exercises**  
   Hands-on exercises to reinforce learning and apply Series methods in real-world scenarios.

---

## Datasets Used
- [disham993/9000-movies-dataset](https://www.kaggle.com/datasets/disham993/9000-movies-dataset)  

### About Dataset

#### Context
This dataset is designed for building a movie recommender system using Natural Language Processing and Machine Learning. It provides valuable data for learners exploring Data Science concepts.

#### Content
##### Features of the dataset:
- **Release_Date**: Date when the movie was released.
- **Title**: Name of the movie.
- **Overview**: Brief summary of the movie.
- **Popularity**: A metric computed by TMDB based on views, votes, favorites, and more.
- **Vote_Count**: Total number of votes received from viewers.
- **Vote_Average**: Average rating based on vote count and number of viewers (out of 10).
- **Original_Language**: Original language of the movie (dubbed versions are excluded).
- **Genre**: Categories the movie belongs to.
- **Poster_Url**: URL of the movie poster.

#### Acknowledgements
Special thanks to Mr. Nitish Singh from CampusX (https://www.youtube.com/channel/UCCWi3hpnq_Pe03nGxuS7isg) for creating easy-to-follow tutorials.

#### Inspiration
A recommender system can be built using this CSV data.

#### Source
The data was fetched using the API from [The Movie Database](https://developers.themoviedb.org/3/movies/get-popular-movies) and cleaned using Pandas and Numpy libraries in Python.

---

## Author
**Author Name:** Juan Alejandro Carrillo Jaimes  

**Contact:** [jalejandrocjaimes@gmail.com](mailto:jalejandrocjaimes@gmail.com) - [Linkedin-AlejoCJaimes31](https://www.linkedin.com/in/alejocjaimes31/)  

**Purpose:** This content was created as an educational resource for university students.

---


# 1. Introduction  
The **goal** of this chapter is to introduce a foundation of pandas by thoroughly inspecting the **Series** data structure. Understanding Series is essential for effective data manipulation in Pandas.

In this chapter, you **will learn** how to apply built-in Series methods, perform operations on Series, and chain multiple methods to transform and analyze data efficiently.

## Summary: Understanding Pandas Series  

## Pandas Series  
<p align="center">
  <img src="https://bites-data.s3.us-east-2.amazonaws.com/series_spreadsheet.png" width="500" height="300"/>
</p>  

A **Pandas Series** is a **one-dimensional labeled array** capable of holding any data type (integers, floats, strings, etc.). Each element in a Series is associated with an **index**, which acts like row labels.

### Why is it important?  
Pandas Series allows for fast, flexible, and intuitive data manipulation. It is a fundamental tool in data science for handling structured data efficiently. 

📌 **Key Characteristics of a Pandas Series:**  
- One-dimensional (like a column in a DataFrame)  
- Supports different data types (int, float, string, datetime, etc.)  
- Indexed automatically (or can be customized)  
- Can perform vectorized operations  

### Code  
```python
import pandas as pd

# Creating a Series with default index
numbers = pd.Series([10, 20, 30, 40, 50])
print(numbers)

# Creating a Series with custom index
grades = pd.Series([90, 85, 88], index=['Alice', 'Bob', 'Charlie'])
print(grades)
```
____  

## Differences Between Series and DataFrame  
<p align="center">
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*RZ1nbLkRCn-8Hu7DK3I4jA.png" width="500" height="300"/>
</p>  

Pandas provides **two primary data structures**:  

| Feature        | Series | DataFrame |
|---------------|--------|-----------|
| **Structure** | 1D Array | 2D Table (Rows & Columns) |
| **Data Type** | Homogeneous (single type) | Heterogeneous (multiple types) |
| **Indexing**  | Single index | Row and Column index |
| **Operations** | Element-wise operations | Complex operations (grouping, merging) |

📌 **Example Comparison:**  
```python
# Series: Single Column
series_example = pd.Series([100, 200, 300])
print(series_example)

# DataFrame: Multiple Columns
df_example = pd.DataFrame({'A': [100, 200, 300], 'B': [10, 20, 30]})
print(df_example)
```
____  

## Importance of Series Methods in Data Analysis  
<p align="center">
  <img src="https://media.licdn.com/dms/image/D5612AQEjQS0vCTKb-g/article-cover_image-shrink_720_1280/0/1693747063213?e=2147483647&v=beta&t=dxnYu_RCztDHDax1cv1Y8iqEowJkkpcI1wCOy9mRtdQ" width="300" height="200"/>
</p>  

Pandas Series methods provide **efficient data manipulation** for:  
✅ **Data Cleaning:** Handling missing values, converting data types  
✅ **Data Transformation:** Applying functions, mapping, filtering  
✅ **Statistical Analysis:** Computing mean, median, sum, standard deviation  
✅ **Text Processing:** Converting case, extracting substrings, pattern matching  

📌 **Example: Applying Series Methods**  
```python
# Creating a Series
data = pd.Series([5, 10, 15, 20, 25, None])

# Using Series methods
print(data.mean())  # Calculates mean (ignores NaN)
print(data.fillna(0))  # Replaces NaN with 0
print(data.astype(str))  # Converts numbers to strings
```
---

### Data Set
I assume you read the notebook `C1-Introduction-To-Pandas-and-DataFrame-Structure.ipynb` in the **Datasets Important Information** section.


# 2. Key concepts

## 2. Series

### 2.1 Calling Series Methods

In [38]:
# import pandas
import pandas as pd
import numpy as np

In [2]:
# read file
path_file = '../datasets/movies-kaggle-df/mymoviedb.csv'
movies = pd.read_csv(path_file, engine='python', on_bad_lines='skip', quotechar='"')

Calling **Series** methods is the primary way to use the abilities that the Series offers. We can use the `built-in` **dir** function to uncover all the attributes and methods of a **Series**.

In [4]:
set(dir(pd.Series))

{'T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__column_consortium_standard__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__

Pandas `DataFrame` and Pandas `Series` **sharing** common methods. As you can see there is a lot of functionality on both of these objects.

In [8]:
series_attr_methods = set(dir(pd.Series))
df_attr_methods = set(dir(pd.DataFrame))
print(f'Total Series methods: {len(series_attr_methods)}\nTotal DataFrame methods: {len(df_attr_methods)}')
print(f'Common methods: {len(series_attr_methods & df_attr_methods)}')

Total Series methods: 419
Total DataFrame methods: 437
Common methods: 362


In [9]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9837 entries, 0 to 9836
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Release_Date       9837 non-null   object 
 1   Title              9828 non-null   object 
 2   Overview           9828 non-null   object 
 3   Popularity         9827 non-null   float64
 4   Vote_Count         9827 non-null   object 
 5   Vote_Average       9827 non-null   object 
 6   Original_Language  9827 non-null   object 
 7   Genre              9826 non-null   object 
 8   Poster_Url         9826 non-null   object 
dtypes: float64(1), object(8)
memory usage: 691.8+ KB


In [12]:
movies.columns = movies.columns.str.lower().str.replace(' ', '_')

In [13]:
movies_titles = movies['title']
popularity_movies = movies['popularity']

In [16]:
# check dtype series
print(movies_titles.dtype)
print(popularity_movies.dtype)

object
float64


#### Sample
The `sample()` method in Pandas is used to extract a random sample of data from a DataFrame or a Series. You can specify how many rows you want to select, or if you want to perform the selection with or without replacement.

##### Basic Syntax:
```python
DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
```

##### Most Common Parameters:
- **`n`**: Number of elements to select (integer). If `n` is provided, `frac` must be `None`.
- **`frac`**: Proportion of data to select (a floating-point number). If `frac` is provided, `n` must be `None`.
- **`replace`**: If `True`, the sample will be taken with replacement (i.e., selected elements may be selected multiple times). By default, it is `False` (without replacement).
- **`random_state`**: Ensures the sample is reproducible if a specific value is set.
- **`weights`**: Allows specifying different probabilities for selecting each item in the sample.
- **`axis`**: If `0`, selects rows; if `1`, selects columns.

In [17]:
movies_titles.sample(n=8, random_state=42)

9135                                           Steel Rain
4252                              While You Were Sleeping
3662                         Violet Evergarden: The Movie
6454                                           Striptease
518                                              Zootopia
6220    Ghost in the Shell Arise - Border 4: Ghost Sta...
5135                                     The Woman in Red
8078                                        The Two Popes
Name: title, dtype: object

In [19]:
popularity_movies.sample(frac=0.1)

7610     15.802
2627     33.536
453     111.633
1074     62.450
9266     13.889
         ...   
1068     62.667
1499     50.759
5242     20.297
2646     33.379
6044     18.461
Name: popularity, Length: 984, dtype: float64

for calculate frequencies in Series, you can use **value_counts()**. This method is typically more useful for Series with `object` data types but can ocassionally provide insight into `numeric` Series as well. In this case _popularity_movies_ receives exactly 6 votes which is equivalent to **14%** of popularity

In [20]:
movies_titles.value_counts()

title
Beauty and the Beast                    4
Alice in Wonderland                     4
The Three Musketeers                    3
Black Christmas                         3
The Kid                                 3
                                       ..
Unlawful Entry                          1
Badlands                                1
Violent Delights                        1
The Offering                            1
The United States vs. Billie Holiday    1
Name: count, Length: 9514, dtype: int64

In [21]:
popularity_movies.value_counts()

popularity
14.696    6
13.510    5
16.652    5
14.773    5
14.437    5
         ..
13.358    1
59.425    1
13.356    1
13.355    1
14.978    1
Name: count, Length: 8160, dtype: int64

In Pandas, the **`Series`** object has several built-in attributes that provide useful information about the structure and size of the data. Three of the most common ones are **`.size`**, **`.shape`**, and **`.len`**.

- **`.size`**: Returns the total number of elements in the Series.
- **`.shape`**: Returns a tuple representing the shape (dimensions) of the Series (for a Series, it’s always `(n,)` where `n` is the number of elements).
- **`len()`**: A built-in Python function that also returns the number of elements in the Series, similar to `.size`.

In [26]:
popularity_movies.size # size of series

9837

In [27]:
movies_titles.shape

(9837,)

In [28]:
len(movies_titles)

9837

`count` methods, which returns the count of **non-missing** values. In this case, `popularity_movies` has **10** missing values.

In [30]:
popularity_movies.count()

np.int64(9827)

#### Summary Statistics
The methods **min()**, **max()**, **mean()**, **median()**,**std()** and **quantile()** in Pandas, following the requested format. These methods are often used for basic **statistical analysis** and can be applied to a Series or DataFrame in Pandas.

##### **`min()`**
The **`min()`** method returns the **minimum value** from a Series or DataFrame. It is useful when you want to identify the lowest value in a dataset.

**Basic Syntax**:
```python
Series.min(axis=None, skipna=True, *args, **kwargs)
```

**Most Common Parameters**:
- **`axis`**: By default, it is `None`, meaning it applies to the entire Series or DataFrame. You can set it to `0` for rows and `1` for columns in DataFrame.
- **`skipna`**: If `True`, it will exclude `NaN` values. If `False`, it will return `NaN` if any value is missing.

**Math Description**:
The **minimum** is the smallest value in the dataset.

In [31]:
popularity_movies.min()

np.float64(7.1)

##### **`max()`**
The **`max()`** method returns the **maximum value** from a Series or DataFrame. It helps identify the highest value in a dataset.

**Basic Syntax**:
```python
Series.max(axis=None, skipna=True, *args, **kwargs)
```

**Most Common Parameters**:
- **`axis`**: Similar to `min()`, it defaults to `None` for Series, and can be set to `0` or `1` for DataFrames.
- **`skipna`**: If `True`, it excludes `NaN` values during calculation.

**Math Description**:
The **maximum** is the largest value in the dataset.

In [32]:
popularity_movies.max()

np.float64(5083.954)

##### **`mean()`**
The **`mean()`** method calculates the **average** (or arithmetic mean) of the values in a Series or DataFrame.

**Basic Syntax**:
```python
Series.mean(axis=None, skipna=True, *args, **kwargs)
```

**Most Common Parameters**:
- **`axis`**: Defaults to `None` for Series. For DataFrame, you can specify `0` (columns) or `1` (rows).
- **`skipna`**: By default, it is `True`, meaning it will exclude `NaN` values from the calculation.

**Math Description**:
The **mean** is calculated as the sum of all values divided by the number of values:

$\text{Mean} = \frac{\sum x_i}{n}$

Where \(x_i\) is each value in the dataset, and \(n\) is the total number of values.

In [33]:
popularity_movies.mean()

np.float64(40.32056996031343)

##### **`median()`**
The **`median()`** method returns the **middle value** of a Series or DataFrame when the values are sorted in order. If there is an even number of values, it returns the average of the two middle values.

**Basic Syntax**:
```python
Series.median(axis=None, skipna=True, *args, **kwargs)
```

**Most Common Parameters**:
- **`axis`**: Defaults to `None` for Series. For DataFrames, you can specify `0` (columns) or `1` (rows).
- **`skipna`**: By default, it is `True`, excluding `NaN` values.

**Math Description**:
The **median** is the middle value when the data is sorted. If there is an even number of values, it’s the average of the two middle values.

In [34]:
popularity_movies.median()

np.float64(21.191)

##### **`std()`**
The **`std()`** method calculates the **standard deviation** of the values in a Series or DataFrame, which is a measure of the amount of variation or dispersion of the data.

**Basic Syntax**:
```python
Series.std(axis=None, skipna=True, ddof=1, *args, **kwargs)
```

**Most Common Parameters**:
- **`axis`**: Defaults to `None`. For DataFrames, `0` means calculating column-wise and `1` row-wise.
- **`skipna`**: Excludes `NaN` values by default.
- **`ddof`**: Delta Degrees of Freedom. The default value is `1`, which gives the sample standard deviation. If set to `0`, it gives the population standard deviation.

**Math Description**:
The **standard deviation** measures the spread of data points around the mean and is calculated as:

$\text{Standard Deviation} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \text{mean})^2}$

Where \(x_i\) are the values, and \(n\) is the number of values in the dataset.


In [35]:
popularity_movies.std()

np.float64(108.87430773029483)

##### **`quantile()`**  
The **`quantile()`** method calculates the specified **quantile value** of a Series or DataFrame, which helps in understanding the distribution of data. Quantiles divide the data into equal parts, making it useful for statistical analysis and outlier detection.  

**Basic Syntax**:  
```python
Series.quantile(q=0.5, interpolation='linear')
DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')
```  

**Most Common Parameters**:  
- **`q`**: A float or list of floats between `0` and `1`, representing the quantile(s) to compute.  
  - `q=0.25` (25th percentile, Q1)  
  - `q=0.50` (50th percentile, median, Q2)  
  - `q=0.75` (75th percentile, Q3)  
- **`axis`**: Determines whether to calculate quantiles along rows (`axis=0`) or columns (`axis=1` in DataFrames).  
- **`numeric_only`**: If `True`, it includes only numeric columns, excluding non-numeric data.  
- **`interpolation`**: Defines how to handle cases where the desired quantile falls between two values (default is `'linear'`).  

**Math Description**:  
A **quantile** is a value that divides the dataset into equal-sized parts. The formula for computing a quantile for a dataset sorted in ascending order is:  

$Q_q = x_{\lceil q(n-1) \rceil}$

where:  
- \( $q$ \) is the desired quantile (e.g., 0.25 for Q1).  
- \( $n$ \) is the total number of data points.  
- \( $x$ \) represents the ordered dataset.  
- \( $\lceil$ \) and \( $\rceil$ \) indicate rounding based on the interpolation method.  

**Use Case**:  
- Helps in identifying **outliers** using the **Interquartile Range (IQR)** formula:  
  $IQR = Q3 - Q1$

  Any value **below** \( $Q1 - 1.5 \times IQR$ \) or **above** \( $Q3 + 1.5 \times IQR$ \) is considered an outlier.  
- Commonly used in **data analysis, statistical modeling, and machine learning** for handling skewed distributions.

In [39]:
quantiles = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
popularity_movies.quantile(quantiles)

0.1    14.2870
0.2    15.4232
0.3    16.8852
0.4    18.7250
0.5    21.1910
0.6    24.8046
0.7    30.5898
0.8    41.0882
0.9    66.2102
Name: popularity, dtype: float64

##### **What patterns can we see?**  
- **Skewed distribution**:  
  - The difference between **Q1 (14.2870)** and the **median (21.1910)** is small.  
  - However, the difference between **Q3 (41.0882)** and the **90th percentile (66.2102)** is large.  
  - This indicates that **some movies have extremely high popularity**, suggesting that the distribution is **right-skewed** (a few movies are significantly more popular than most).  

- **Average vs. popular movies**:  
  - **Half of the movies (Q2, the median) have a popularity below 21.1910**, meaning that **a movie with a popularity of 22 is not really "high"**.  
  - The **truly popular movies** are in the **90th percentile (66.2102)**, meaning that a movie with a popularity above ~66 is in the **top 10% of the most viewed, voted, or favorited movies**.

You may use the `.describe` method to return both the summary statistics and a few of the quantiles at once. When `.describe` is used with an `object` data type column, a completely different output is returned

In [36]:
popularity_movies.describe()

count    9827.000000
mean       40.320570
std       108.874308
min         7.100000
25%        16.127500
50%        21.191000
75%        35.174500
max      5083.954000
Name: popularity, dtype: float64

In [37]:
movies_titles.describe()

count                     9828
unique                    9514
top       Beauty and the Beast
freq                         4
Name: title, dtype: object

In [41]:
popularity_movies.isna().sum()

np.int64(10)

#### Summary Statistics
The methods **min()**, **max()**, **mean()**, **median()**,**std()** and **quantile()** in Pandas, following the requested format. These methods are often used for basic **statistical analysis** and can be applied to a Series or DataFrame in Pandas.

#### **Handling Missing Values in Pandas**  
Missing values are common in real-world datasets. Pandas provides several methods to detect, fill, or remove missing data, ensuring cleaner and more reliable analysis.

---

##### **`.isna()`**  
The **`.isna()`** method checks for missing (`NaN`) values in a Series or DataFrame, returning a boolean mask.  

**Basic Syntax**:  
```python
Series.isna()
DataFrame.isna()
```

**Most Common Use Case**:  
- Helps identify missing values for data cleaning.

In [46]:
print(f"Nulls values in popularity_movies: {popularity_movies.isna().sum()}")

Nulls values in popularity_movies: 10


In [47]:
print(f"Nulls values in movies_titles: {movies_titles.isna().sum()}")

Nulls values in movies_titles: 9


##### **`.fillna()`**  
The **`.fillna()`** method replaces missing values with a specified value or strategy.  

**Basic Syntax**:  
```python
Series.fillna(value, method=None)
DataFrame.fillna(value, method=None, axis=None)
```

**Most Common Parameters**:  
- **`value`**: The value used to replace `NaN` (e.g., `0`, `mean()`, `"unknown"`).  
- **`method`**: Use `"ffill"` (forward fill) or `"bfill"` (backward fill) to propagate values. 

In [53]:
# Replace the null values with the median
print(f"Nulls values in popularity_movies: {popularity_movies.isna().sum()}")
popularity_movies_filled = popularity_movies.fillna(popularity_movies.median())
print(f"Nulls values in popularity_movies: {popularity_movies_filled.isna().sum()}")

Nulls values in popularity_movies: 0
Nulls values in popularity_movies: 0


In [52]:
print(f"Nulls values in movies_titles: {movies_titles.isna().sum()}")
movies_titles_filled = movies_titles.fillna('Missing Title')
print(f"Nulls values in movies_titles: {movies_titles_filled.isna().sum()}")

Nulls values in movies_titles: 9
Nulls values in movies_titles: 0


##### **`.dropna()`**  
The **`.dropna()`** method removes rows or columns containing missing values.  

**Basic Syntax**:  
```python
Series.dropna()
DataFrame.dropna(axis=0, how='any')
```

**Most Common Parameters**:  
- **`axis=0`**: Drops rows with `NaN` (default).  
- **`axis=1`**: Drops columns with `NaN`.  
- **`how='any'`**: Removes rows/columns with at least one `NaN`.  
- **`how='all'`**: Removes rows/columns only if all values are `NaN`.

In [54]:
movies_title_dropped = movies_titles.dropna()
movies_title_dropped.size

9828

### 2.2 Series Operations
There exis a lot number of operators in Python for manipuling objects. 

`Series` and `DataFrames` support many of the Python operators. Typically, a new Series or DataFrame is returned when using a operator.

In [72]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9837 entries, 0 to 9836
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   release_date       9837 non-null   object 
 1   title              9828 non-null   object 
 2   overview           9828 non-null   object 
 3   popularity         9827 non-null   float64
 4   vote_count         9827 non-null   object 
 5   vote_average       9827 non-null   object 
 6   original_language  9827 non-null   object 
 7   genre              9826 non-null   object 
 8   poster_url         9826 non-null   object 
dtypes: float64(1), object(8)
memory usage: 691.8+ KB


In [73]:
avg_rating = movies['vote_average']
avg_rating.dtype

dtype('O')

In [74]:
avg_rating = pd.to_numeric(avg_rating, errors='coerce')
avg_rating.dtype

dtype('float64')

In [75]:
print("===ANALYSIS===")
print(f"Null ratings: {avg_rating.isna().sum()}")
print(f"Complete ratings: {avg_rating.count()}")
avg_rating.head()

===ANALYSIS===
Null ratings: 11
Complete ratings: 9826


0    8.3
1    8.1
2    6.3
3    7.7
4    7.0
Name: vote_average, dtype: float64

Use the plus operator to add one to each `Series` element

In [76]:
avg_rating + 1

0       9.3
1       9.1
2       7.3
3       8.7
4       8.0
       ... 
9832    8.6
9833    4.5
9834    6.0
9835    7.7
9836    8.8
Name: vote_average, Length: 9837, dtype: float64

The other basic aritmethic operators, minus, multiplication, division, and exponentiation (**) work similarly with scalar values.

In [77]:
avg_rating * 0.15

0       1.245
1       1.215
2       0.945
3       1.155
4       1.050
        ...  
9832    1.140
9833    0.525
9834    0.750
9835    1.005
9836    1.170
Name: vote_average, Length: 9837, dtype: float64

In [78]:
avg_rating - 1.5

0       6.8
1       6.6
2       4.8
3       6.2
4       5.5
       ... 
9832    6.1
9833    2.0
9834    3.5
9835    5.2
9836    6.3
Name: vote_average, Length: 9837, dtype: float64

Yoy can use `//` for floor divisiob. The floor division operator truncates the result of the division.

In [79]:
avg_rating // 2

0       4.0
1       4.0
2       3.0
3       3.0
4       3.0
       ... 
9832    3.0
9833    1.0
9834    2.0
9835    3.0
9836    3.0
Name: vote_average, Length: 9837, dtype: float64

There exist six comparision operators. The result is a Boolean array, which is very useful for filtering data.

Here is a table of the six comparison operators in Python:

| Operator | Description              | Example       | Result  |
|----------|--------------------------|--------------|---------|
| `>`      | Greater than              | `5 > 3`      | `True`  |
| `<`      | Less than                 | `5 < 3`      | `False` |
| `>=`     | Greater than or equal to  | `5 >= 5`     | `True`  |
| `<=`     | Less than or equal to     | `5 <= 3`     | `False` |
| `==`     | Equal to                  | `5 == 5`     | `True`  |
| `!=`     | Not equal to              | `5 != 3`     | `True`  |

In [80]:
avg_rating > 5

0        True
1        True
2        True
3        True
4        True
        ...  
9832     True
9833    False
9834    False
9835     True
9836     True
Name: vote_average, Length: 9837, dtype: bool

In [83]:
avg_rating != 8.3

0       False
1        True
2        True
3        True
4        True
        ...  
9832     True
9833     True
9834     True
9835     True
9836     True
Name: vote_average, Length: 9837, dtype: bool

In [85]:
avg_rating == 7.7

0       False
1       False
2       False
3        True
4       False
        ...  
9832    False
9833    False
9834    False
9835    False
9836    False
Name: vote_average, Length: 9837, dtype: bool

All of the operators used, have the equivalents that produce the exact same result. 

| Python Operator | Description                      | Pandas Equivalent | Operator Group        |
|----------------|----------------------------------|-------------------|-----------------------|
| `+`           | Addition                         | `.add()`          | Arithmetic            |
| `-`           | Subtraction                      | `.sub()`          | Arithmetic            |
| `*`           | Multiplication                   | `.mul()`          | Arithmetic            |
| `/`           | Division                         | `.div()`          | Arithmetic            |
| `//`          | Floor Division                   | `.floordiv()`     | Arithmetic            |
| `%`           | Modulus (Remainder)              | `.mod()`          | Arithmetic            |
| `**`          | Exponentiation                   | `.pow()`          | Arithmetic            |
| `>`           | Greater than                     | `.gt()`           | Comparison            |
| `<`           | Less than                        | `.lt()`           | Comparison            |
| `>=`          | Greater than or equal to         | `.ge()`           | Comparison            |
| `<=`          | Less than or equal to            | `.le()`           | Comparison            |
| `==`          | Equal to                         | `.eq()`           | Comparison            |
| `!=`          | Not equal to                     | `.ne()`           | Comparison            |
| `&`           | Logical AND (bitwise)            | `.and_()`         | Logical (Bitwise)     |
| `|`           | Logical OR (bitwise)             | `.or_()`          | Logical (Bitwise)     |
| `^`           | Logical XOR (bitwise)            | `.xor()`          | Logical (Bitwise)     |
| `~`           | Logical NOT (bitwise)            | `.invert()`       | Logical (Bitwise)     |

In [86]:
avg_rating.lt(7.7)

0       False
1       False
2        True
3       False
4        True
        ...  
9832     True
9833     True
9834     True
9835     True
9836    False
Name: vote_average, Length: 9837, dtype: bool

In [87]:
avg_rating - 15

0       -6.7
1       -6.9
2       -8.7
3       -7.3
4       -8.0
        ... 
9832    -7.4
9833   -11.5
9834   -10.0
9835    -8.3
9836    -7.2
Name: vote_average, Length: 9837, dtype: float64

The **sub** perfom substraction on Series, this allows you to specify a `fill_value` parameter to use in place of missing values.

In [89]:
avg_rating.sub(15, fill_value=0)

0       -6.7
1       -6.9
2       -8.7
3       -7.3
4       -8.0
        ... 
9832    -7.4
9833   -11.5
9834   -10.0
9835    -8.3
9836    -7.2
Name: vote_average, Length: 9837, dtype: float64

### 2.3 Chaining Series Methods
In Python, every varaible points to an object, and many attributes and methods return new objects. This allows sequential invocation of methods using attribute access. This is called method `chaining or flow programming`.

Although it is possible to write the entire method chain in a single unbroken line, it is far more palatable to write a single method per line.

Personally, I recommend using a backslah (\\) for each line to indicate the continuation of the line in the next line for better readability.


In [90]:
language = movies['original_language']

In [93]:
language.value_counts().head()

original_language
en    7569
ja     645
es     339
fr     292
ko     170
Name: count, dtype: int64

In [94]:
# standarize all in upper case
language_upper = language.str.upper()
language_upper.value_counts().head()

original_language
EN    7569
JA     645
ES     339
FR     292
KO     170
Name: count, dtype: int64

In [104]:
# Select the titles with the most common languages
# EN, ES, FR, RU, JA
# Show the language with more titles.
mask = language_upper.isin(
    ['EN', 'ES', 'FR', 'RU', 'JA']
)

print (language_upper[mask].value_counts() \
    .head())

language_upper[mask].value_counts() \
    .sort_index() \
    .head() \
    .idxmax()   

original_language
EN    7569
JA     645
ES     339
FR     292
RU      83
Name: count, dtype: int64


'EN'

One option for debugging chains is to call the `.pipe` method to show an intermediate value. The `.pipe` method on a `Series` needs to be passed a function that accepts a `Series` as input and can return anything.

In [107]:
movies.columns

Index(['release_date', 'title', 'overview', 'popularity', 'vote_count',
       'vote_average', 'original_language', 'genre', 'poster_url'],
      dtype='object')

In [120]:
def debug_release_movies(ser:pd.Series) -> pd.Series:
    print("BEFORE")
    print(ser)
    print("AFTER")
    print(ser)
    return ser

In [108]:
release_movies = movies['release_date']

In [122]:
# convert to datetime in format YYYY-MM-DD and fill nulls with 1900-01-01
release_movies = pd.to_datetime(release_movies, 
                              errors='coerce', 
                              format='%Y-%m-%d') \
                .pipe(debug_release_movies) \
                .fillna('1900-01-01') \
                .head()

BEFORE
0   2021-12-15
1   2022-03-01
2   2022-02-25
3   2021-11-24
4   2021-12-22
Name: release_date, dtype: datetime64[ns]
AFTER
0   2021-12-15
1   2022-03-01
2   2022-02-25
3   2021-11-24
4   2021-12-22
Name: release_date, dtype: datetime64[ns]


If you want to create a global variable to store an intermediate value you can also use `.pipe`

In [123]:
intermediate = None
def get_intermediate(ser:pd.Series) -> pd.Series:
    global intermediate
    intermediate = ser
    return ser

In [124]:
debug_intermediate = pd.to_datetime(release_movies, 
                              errors='coerce', 
                              format='%Y-%m-%d') \
                .pipe(get_intermediate) \
                .fillna('1900-01-01') \
                .head()

In [125]:
debug_intermediate

0   2021-12-15
1   2022-03-01
2   2022-02-25
3   2021-11-24
4   2021-12-22
Name: release_date, dtype: datetime64[ns]

In [126]:
intermediate

0   2021-12-15
1   2022-03-01
2   2022-02-25
3   2021-11-24
4   2021-12-22
Name: release_date, dtype: datetime64[ns]

# 3. Exercises

## Exercise Set: Pandas Series Key Concepts
### **Dataset**: 9000 Movies Dataset  
📌 Make sure you have downloaded the dataset from Kaggle before running the exercises.

---

### **Exercise 1: Calling Series Methods**  
**Objective**: Learn how to call basic Series methods to analyze data.

**Task**:
1. Select the "Popularity" column from the dataset.
2. Use the `.min()` method to find the lowest popularity value in the dataset.
3. Use the `.max()` method to find the highest popularity value in the dataset.
4. Find the mean of the popularity column using the `.mean()` method.
5. Find the median popularity using the `.median()` method.

📌 **Hint**: Use `.min()`, `.max()`, `.mean()`, and `.median()` methods on the 'Popularity' column.


---

### **Exercise 2: Series Operations**  
**Objective**: Perform basic operations on a Series.

**Task**:
1. Create a new Series by adding 10 to the "Popularity" column.
2. Create a new Series by dividing the "Popularity" column by the "Vote_Count" column.
3. Check if any value in the "Popularity" column is greater than 50.

📌 **Hint**: Use arithmetic operations and comparison operators to perform these tasks.

---

### **Exercise 3: Chaining Series Methods**  
**Objective**: Chain multiple Series methods together.

**Task**:
1. First, select the "Popularity" column.
2. Apply the `.dropna()` method to remove any missing values.
3. Apply the `.apply()` method to increase each remaining value by 20%.
4. Calculate the mean of the updated "Popularity" values using the `.mean()` method.

📌 **Hint**: Chain `.dropna()`, `.apply()`, and `.mean()` methods.

---

### **Exercise 4: Handling Missing Values**  
**Objective**: Practice handling missing values in a Series.

**Task**:
1. Identify the missing values in the "Vote_Average" column.
2. Use the `.fillna()` method to replace missing values in "Vote_Average" with the mean of the column.
3. Use the `.dropna()` method to remove rows with missing values in the "Genre" column.

📌 **Hint**: Use `.isna()`, `.fillna()`, and `.dropna()` methods to handle missing data.

---

### **Exercise 5: Applying Custom Functions to a Series**  
**Objective**: Apply custom functions to a Series using `.apply()`.

**Task**:
1. Create a custom function that checks if a movie's popularity is above the average and returns "High" or "Low".
2. Apply this function to the "Popularity" column to create a new column "Popularity_Level" with the results.

📌 **Hint**: Use `.apply()` to apply your custom function to a Series.

---