### Lab 3: CSV Analysis with Python (Netflix Dataset)

**Learning goals:**

* Load and parse CSV files using Python.
* Build custom reusable functions to process tabular data.
* Use generators with `yield` to write memory-efficient code.
* Filter and analyze records from a dataset using conditional logic.
* Create summaries, counters, and frequency tables using `for` loops and dictionaries.
* Perform basic sorting, aggregation, and statistical operations manually.

Dataset:

* Download the data from Kaggle:
  * [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows)

1. **Load the data in Python as a dictionary.**

Read the CSV data into a list of dictionaries, for later usage.

In [1]:
import csv

with open('C:/Users/markt/OneDrive/Documents/Birkbeck/MSc Data Science/Big Data Analytics/Lab-Exercises/Lab3/netflix_titles.csv', mode='r', newline='') as file:
    reader = csv.DictReader(file)
    data = [row for row in reader]

print(data[1])

{'show_id': 's2', 'type': 'TV Show', 'title': 'Blood & Water', 'director': '', 'cast': 'Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng', 'country': 'South Africa', 'date_added': 'September 24, 2021', 'release_year': '2021', 'rating': 'TV-MA', 'duration': '2 Seasons', 'listed_in': 'International TV Shows, TV Dramas, TV Mysteries', 'description': 'After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.'}


**Here is the function implementation.**

In [4]:
import csv

def load_data(filename):
    with open(filename, mode='r', newline='', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        data = [row for row in reader]
    return data

netflix_data = load_data('C:/Users/markt/OneDrive/Documents/Birkbeck/MSc Data Science/Big Data Analytics/Lab-Exercises/Lab3/netflix_titles.csv')
print(netflix_data[:10])  # Print first entry

[{'show_id': 's1', 'type': 'Movie', 'title': 'Dick Johnson Is Dead', 'director': 'Kirsten Johnson', 'cast': '', 'country': 'United States', 'date_added': 'September 25, 2021', 'release_year': '2020', 'rating': 'PG-13', 'duration': '90 min', 'listed_in': 'Documentaries', 'description': 'As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.'}, {'show_id': 's2', 'type': 'TV Show', 'title': 'Blood & Water', 'director': '', 'cast': 'Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng', 'country': 'South Africa', 'date_added': 'September 24, 2021', 'release_year': '2021', 'rating': 'TV-MA', 'duration': '2 Seasons', 'listed_in': 'International TV S

**Time Complexity**: `O(n·m)`

- Opening file: `O(1)`
- Reading file line by line: `O(n)`, where n is the number of rows (excluding the header)
- Parsing each row into a dictionary: `O(n·m)`, where m is the number of columns (fields)
  - Each field is mapped to a key, so constructing a dict is `O(m)`
- List comprehension to store all rows: `O(n)`

**Space Complexity**: `O(n·m)`

* Data stores all n rows in memory
* Each row is a dictionary with m key-value pairs


2. **Create a function called `my_head(alist,limit)` to return the `n` first records of the dataset in a new list.**


In [8]:
def my_head(alist,limit):
    n = 0
    for n in range(limit):
        if n < len(alist):
            yield alist[n]
        else:
            break

for item in my_head(load_data('C:/Users/markt/OneDrive/Documents/Birkbeck/MSc Data Science/Big Data Analytics/Lab-Exercises/Lab3/netflix_titles.csv'),10):
    print(item)

{'show_id': 's1', 'type': 'Movie', 'title': 'Dick Johnson Is Dead', 'director': 'Kirsten Johnson', 'cast': '', 'country': 'United States', 'date_added': 'September 25, 2021', 'release_year': '2020', 'rating': 'PG-13', 'duration': '90 min', 'listed_in': 'Documentaries', 'description': 'As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.'}
{'show_id': 's2', 'type': 'TV Show', 'title': 'Blood & Water', 'director': '', 'cast': 'Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng', 'country': 'South Africa', 'date_added': 'September 24, 2021', 'release_year': '2021', 'rating': 'TV-MA', 'duration': '2 Seasons', 'listed_in': 'International TV Sho

**Why use `yield`?**

- You get one item at a time (no list built in memory).
- It’s more efficient when you only need a few items from a large list or stream.
- Works well in pipelines or streaming scenarios.

Time Complexity: O(k) where k = min(limit,len(alist))

Space complexity: O(1) as just one item is in memory at each point in time


3. **Create a function called `my_head_col(alist,col,limit)`  to return the first records of a specific column from the dataset as a list.**

In [11]:
def my_head_col(alist,col,limit):
    return_data = []
    for n in range(limit):
        if n < len(alist):
            return_data.append(alist[n][col])
        else:
            break
    return return_data

print(my_head_col(data,"title",10))

['Dick Johnson Is Dead', 'Blood & Water', 'Ganglands', 'Jailbirds New Orleans', 'Kota Factory', 'Midnight Mass', 'My Little Pony: A New Generation', 'Sankofa', 'The Great British Baking Show', 'The Starling']


Time complexity: O(n) - worst case is limit = len(alist)

Space Complexity: O(k) - where k is the limit (worst case = O(n))



---

4. **Filters `titles` added in the year `2021`.**

**Develop a function for `shows_added_in_2021(data)` for `titles` from United States**

* **Solution with `return`**

```python

```

Time complexity:

Space complexity: 

* **Solution with `yield`**

```python

```

**Why use `yield`?**

* Time complexity:

* Space complexity: 

---

6. **Develop a function for `shows_added_in_2021(data)` for `titles` from United States**

Lists titles where country is `United States`.

* **Solution with `return`**

```python

```

Complexities are the same as 5 `return`.

* **Solution with `yield`**

```python

```

**Complexity**

- Time complexity:
- Space complexity: 

---

7. **Titles with `love` (any case).**

Searches for titles containing the word `love` (case-insensitive).

* **Solution with `return`**

```python

```

* **Solution with `yield`**

```python

```

Time complexity:

Space complexity: 

---

8. **PG-13 Movies**

Finds all movies with a `PG-13` rating.

* **Solution with `return`**

```python

```

* Time complexity:
* Space complexity: 

```python

```

- Time complexity:
- Space complexity: 

---

9. **Develop the `my_len` function, to count the total entries**

Counts the number of rows in the dataset.

```python

```

**Can I use yield?**

?

- Time complexity:
- Space complexity: 

---

10. **Count Types**

Counts how many entries are `TV Show` vs. `Movie`.

```python

```

- Time complexity:
- Space complexity: 

------

11. **Count Per Category**

Generate a frequency table

```python

```

**Time Complexity: `O(n)`**

- The function iterates over all `n` rows once.
- Dictionary operations (`in`, `+= 1`, assignment) are on average **O(1)**.
- So the total time is **O(n)**.

**Space Complexity: `O(t)`**, where `t` is the number of unique content types

- A dictionary `type_counts` is built with one entry per unique type (e.g., "Movie", "TV Show", etc.).
- In practice, `t` is small, so this is often treated as **O(1)**.

---

12. **Average TV show seasons**

Calculates average number of seasons for TV shows.

```python

```

- Time complexity:
- Space complexity: 

---

13. Sort by release year using `Bubble sort`.

```python

```

**How Bubble Sort Works:**

- Repeatedly compares adjacent elements
- Swaps them if they're in the wrong order
- "Bubbles" the largest value to the end in each pass

Time complexity: `O(n^2)`

Space complexity: `O(n)` — due to copying the list

---

14. **Convert durations**

The function extracts numeric values from the `"duration"` field and groups them into a dictionary based on units like `"min"`, `"Season"`, or `"Seasons"`. It skips empty or malformed entries.

*It works on a list of dictionaries where each dictionary has a `'duration'` key.*

```python

```

**Output**

```python
{
    "min": [90, 91, 125, 104, 127, 91,...],
    "Seasons": [2, 2, 9,..],
    "Season": [1, 1, 1, 1, 1,...]
}
```

---

15. **What is the distribution of content types (TV Show vs Movie)?**

Create a **bar chart** showing how many titles fall into each type.

```python

```

![Bar chart](bar-chart.png)