# Data Collection and Management with pandas – Notebook

## 📘 Introduction

In this notebook, you'll apply essential data collection and management skills using **pandas**, one of the most widely used Python libraries in data science.

This guide mirrors the structure and tasks outlined in your activity document, but now with working examples, explanations, and notes to help you understand the *why* behind each action.

You will:
- Use pandas to import structured and semi-structured data (CSV and JSON)
- Work with APIs to collect real-world data
- Inspect, clean, and transform datasets
- Save/export datasets for further use
- Combine data from multiple sources
- Build a reusable Python function to automate data collection

Before starting, ensure you have `pandas` and `requests` installed:
```bash
pip install pandas requests
```

In [1]:
import pandas as pd

## 📅 Task 1: Load and Inspect a Local CSV File

In this task, we’ll work with a CSV file containing Netflix movies and TV shows. The goal is to:
- Load the data
- Inspect its structure
- Clean a few things

In [2]:
# Load the Netflix dataset (replace with your actual path)
df_netflix = pd.read_csv("../data/netflix_titles.csv")

# Preview the data
print("\nFirst 5 rows:")
df_netflix.head()


First 5 rows:


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


> `head()` shows us the first few rows of the dataset. This is useful to quickly understand the structure of the table, spot messy entries, and verify it loaded correctly.
>
> For example, we can see that columns like `title`, `director`, and `cast` are mostly strings, but there might be missing values or inconsistent formats in `date_added` and `country`.

In [3]:
# Summary info on structure and data types
print("\nDataFrame info:")
df_netflix.info()


DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


> `info()` tells us the number of entries (rows), how many are non-null (useful to detect missing data), and the data types of each column.
>
> From this output, we might notice that some columns have many missing values (like `director`, `cast`, or `rating`), which could affect future analysis or modeling.

In [4]:
print("\nData types:")
print(df_netflix.dtypes)


Data types:
show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


> `dtypes` confirms the Python/pandas data type of each column — essential for deciding how to clean and process data later (e.g., strings vs. dates). For instance, if `date_added` is still an object (string), we might want to convert it to a datetime type.

In [5]:
# Rename unclear column for clarity
df_netflix = df_netflix.rename(columns={"listed_in": "genres"})

> The original column `listed_in` was vague — renaming it to `genres` makes the dataset clearer and easier to work with.

In [6]:
# Drop a column that won't help with analysis
df_netflix = df_netflix.drop(columns=["show_id"])

> The `show_id` field is just a unique identifier and doesn’t contribute to insights or analysis, so we remove it to simplify the dataset.

🧠 **Check Your Understanding:**
- Which columns might contain missing or messy data?
- What data types are most common in this dataset?
- Are there any fields you'd want to convert or clean before analysis?

> We can already identify that `director` and `cast` will need attention due to frequent null values. The majority of columns are textual (object type), so string operations and cleaning will be important in upcoming steps.

## 🌐 Task 2: Load JSON from a Public API

Here we’ll fetch JSON data from the **REST Countries API**, flatten the nested structure, and extract useful geographic fields.

In [7]:
import requests

# Fetch data from REST Countries API
url = "https://restcountries.com/v3.1/all"
response = requests.get(url)
data = response.json()

> This uses the `requests` library to retrieve live data from a public API. The returned JSON is nested, so we need to process it before analysis.

In [8]:
# Convert nested JSON to flat table
df_countries = pd.json_normalize(data)
df_countries.head()

Unnamed: 0,tld,cca2,ccn3,cca3,independent,status,unMember,capital,altSpellings,region,...,name.nativeName.heb.common,languages.heb,name.nativeName.mri.official,name.nativeName.mri.common,name.nativeName.nzs.official,name.nativeName.nzs.common,languages.mri,languages.nzs,currencies.NIO.name,currencies.NIO.symbol
0,[.gs],GS,239,SGS,False,officially-assigned,False,[King Edward Point],"[GS, South Georgia and the South Sandwich Isla...",Antarctic,...,,,,,,,,,,
1,[.gd],GD,308,GRD,True,officially-assigned,True,[St. George's],[GD],Americas,...,,,,,,,,,,
2,[.ch],CH,756,CHE,True,officially-assigned,True,[Bern],"[CH, Swiss Confederation, Schweiz, Suisse, Svi...",Europe,...,,,,,,,,,,
3,[.sl],SL,694,SLE,True,officially-assigned,True,[Freetown],"[SL, Republic of Sierra Leone]",Africa,...,,,,,,,,,,
4,[.hu],HU,348,HUN,True,officially-assigned,True,[Budapest],[HU],Europe,...,,,,,,,,,,


> `json_normalize` flattens the nested JSON structure, converting it into a readable, tabular format. This helps us easily select fields like population or region without needing to loop through dictionaries.

In [9]:
# Select relevant fields
df_countries_clean = df_countries[[
    "name.common",
    "capital",
    "population",
    "region",
    "area"
]].copy()

> We isolate only the fields we care about — these will later be useful for geographic comparison and merging.

In [10]:
# Rename for consistency
df_countries_clean.rename(columns={
    "name.common": "country_name"
}, inplace=True)

> A clear and consistent column name like `country_name` makes future joins and filters more intuitive.

In [11]:
# Save the cleaned version in the data folder
df_countries_clean.to_csv("../data/countries_clean.csv", index=False)

> Exporting the cleaned dataset ensures it can be reused later without needing to refetch the API.

In [12]:
# Reload the saved cleaned data
df_countries_clean = pd.read_csv("../data/countries_clean.csv")
df_countries_clean.head()

Unnamed: 0,country_name,capital,population,region,area
0,South Georgia,['King Edward Point'],30,Antarctic,3903.0
1,Grenada,"[""St. George's""]",112519,Americas,344.0
2,Switzerland,['Bern'],8654622,Europe,41284.0
3,Sierra Leone,['Freetown'],7976985,Africa,71740.0
4,Hungary,['Budapest'],9749763,Europe,93028.0


🧠 **Check Your Understanding:**
- What do the fields `population` and `area` allow us to analyze?
- How is this dataset different in structure compared to the Netflix one?

> With this dataset, we can analyze country-level metrics like population size, land area, and regional classification. Unlike the Netflix dataset, which is mostly textual and manually entered, this one is quantitative and structured.

---

## 🏓 Task 3: Combine Data from Multiple Sources

Now let’s connect the Netflix dataset with geographic country data.

In [13]:
# Fill missing values in the 'country' column
df_netflix["country"] = df_netflix["country"].fillna("Unknown")

> We replace null country values to avoid errors during processing. This step ensures our merging logic later won’t break due to missing values.

In [14]:
# Extract the first listed country only (some entries have multiple)
df_netflix["country_main"] = df_netflix["country"].str.split(",").str[0].str.strip()

> Some rows have multiple countries listed — we simplify by keeping just the first country. This makes merging with external country data more reliable.

In [15]:
# Merge Netflix with country info
df_merged = pd.merge(df_netflix, df_countries_clean, how="left",
                     left_on="country_main", right_on="country_name")
df_merged.head()

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,genres,description,country_main,country_name,capital,population,region,area
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",United States,United States,"['Washington, D.C.']",329484100.0,Americas,9372610.0
1,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",South Africa,South Africa,"['Pretoria', 'Bloemfontein', 'Cape Town']",59308690.0,Africa,1221037.0
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,Unknown,,,,,
3,TV Show,Jailbirds New Orleans,,,Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",Unknown,,,,,
4,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,India,India,['New Delhi'],1380004000.0,Asia,3287590.0


> Merging brings geographic data into our entertainment dataset. This gives us new angles for analysis like regional content trends.
>
> For example, we can now compare how many Netflix titles exist per region or examine average population size of countries producing Netflix content.

In [16]:
# Save the enriched dataset
df_merged.to_json("../data/netflix_enriched.json", orient="records")

> Saving the final output makes it portable for dashboards or modeling.

In [17]:
# Reload the merged dataset
df_netflix_country= pd.read_json("../data/netflix_enriched.json")
df_netflix_country.head()

Unnamed: 0,type,title,director,cast,country,date_added,release_year,rating,duration,genres,description,country_main,country_name,capital,population,region,area
0,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",United States,United States,"['Washington, D.C.']",329484100.0,Americas,9372610.0
1,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",South Africa,South Africa,"['Pretoria', 'Bloemfontein', 'Cape Town']",59308690.0,Africa,1221037.0
2,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,Unknown,,,,,
3,TV Show,Jailbirds New Orleans,,,Unknown,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",Unknown,,,,,
4,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,India,India,['New Delhi'],1380004000.0,Asia,3287590.0


🧠 **Check Your Understanding:**
- What new insights could this merged dataset unlock?
- What limitations do we face by keeping only the first country?

> Thanks to this merge, we now have additional fields like population, region, and area connected to each Netflix entry. This unlocks potential for questions like: "Are larger countries producing more Netflix content?" or "Which region is most represented on the platform?"

---

## 💡 Integrating Challenge: Build a Reusable Data Collector

In this final task, we combine Python fundamentals with pandas: build a function that pulls JSON data from an API, processes it, and saves it in structured formats.


In [18]:
def collect_and_save_data(api_url, csv_name="output.csv", json_name="output.json"):
    """
    Fetch JSON data from an API, convert to DataFrame,
    optionally rename columns, and save as CSV/JSON.
    Parameters:
    - api_url: the API endpoint to request
    - csv_name: name for the output CSV file
    - json_name: name for the output JSON file
    """
    try:
        response = requests.get(api_url)
        response.raise_for_status()
        data = response.json()
        df = pd.json_normalize(data)

        if len(df.columns) >= 2:
            df.rename(columns={df.columns[0]: "col1", df.columns[1]: "col2"}, inplace=True)

        df.to_csv(f"../data/{csv_name}", index=False)
        df.to_json(f"../data/{json_name}", orient="records")
        print(f"✅ Data saved to ../data/{csv_name} and ../data/{json_name}")
        return df

    except Exception as e:
        print("❌ An error occurred:", e)
        return None

# Example usage with a stable and open API:
df_jokes = collect_and_save_data("https://official-joke-api.appspot.com/jokes/ten", "jokes.csv", "jokes.json")

✅ Data saved to ../data/jokes.csv and ../data/jokes.json


In [19]:
# Reload and preview the jokes dataset
df_loaded_jokes = pd.read_csv("../data/jokes.csv")
print("\nLoaded jokes data:")
df_loaded_jokes.head()


Loaded jokes data:


Unnamed: 0,col1,col2,punchline,id
0,general,What do you do when you see a space man?,"Park your car, man.",227
1,general,What’s Forest Gump’s Facebook password?,1forest1,277
2,programming,Why don't React developers like nature?,They prefer the virtual DOM.,411
3,general,"Why did the fireman wear red, white, and blue ...",To hold his pants up.,323
4,general,Well...,That’s a deep subject.,64


In [20]:
import time

# Print the joke setup
print(df_loaded_jokes["col2"][0])
time.sleep(2)  # wait 2 seconds for dramatic effect

# Print the punchline
print(df_loaded_jokes["punchline"][0])

What do you do when you see a space man?
Park your car, man.


> This function encapsulates multiple steps into a clean, reusable unit. By using parameters, we make it adaptable — students can plug in different APIs and filenames without rewriting the logic.
>
> Parameters like `csv_name` and `json_name` give you flexibility to reuse the function without modifying internal lines.

🧠 **Check Your Understanding:**
- What happens if we pass a different API URL?
- How could you extend this function to include error logs or additional cleaning?

> Parameters like `csv_name` and `json_name` give you flexibility to reuse the function without modifying internal lines. If we pass a new API URL, the function still works, and we can extend it with logs or custom filters later.

---

## ✅ Summary


In this notebook, you practiced the full cycle of data collection and preparation:
- Working with **pandas** for structured data
- Using **requests** and **json_normalize** to handle API data
- Cleaning and renaming columns
- Combining data from multiple sources
- Building **functions** to automate collection

🧠 Along the way, we interpreted outputs step-by-step to help you understand not just what to do — but why. Keep exploring different APIs, try customizing functions further, and experiment with new file formats or sources.

You've just built the foundation of your own data collection toolkit. 💡

Feel free to modify the code, test other APIs, and continue exploring. 🚀

---