<a href="https://colab.research.google.com/github/JJingLu/CBS5055-Generative-Artificial-Intelligence-for-Innovative-Communications/blob/main/W1_python_pandas_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# =================================================
# Instructor: Objectives of Today‚Äôs Workshop
# =================================================
#
# By the end of this workshop, you will be able to:
#
# 1) Understand how to work with a real-world dataset from Hugging Face
#    - Load an open-source dataset using Python
#    - Inspect its structure and understand what the data represents
#
# 2) Practice essential Python and Pandas operations for data analysis
#    - Variables, functions, and basic printing
#    - Working with DataFrames
#    - Selecting, filtering, and summarizing data
#
# 3) Explore text data in a structured way
#    - Examine text length and content
#    - Search for keywords in text fields
#    - Identify common words and patterns
#
# 4) Learn core data cleaning skills used in real projects
#    - Handle missing values
#    - Convert data types (e.g., strings to dates)
#    - Remove duplicate records
#    - Merge multiple tables together
#
# 5) Simulate a real business workflow
#    - Start from raw, unstructured text data
#    - Transform it into a clean, structured table
#    - Export the final result for further analysis or reporting
#
# This notebook is designed for beginners.
# No prior experience with Python, Pandas, or Hugging Face is required.
# We will go step by step and explain each operation as we use it.
#
# =================================================


In [None]:
# ===============================
# Google Colab Python Beginner's Tutorial Example
# Using the Hugging Face dataset: RafaM97/marketing_social_media
# ===============================

# 1Ô∏è‚É£ Install the necessary packages
!pip install datasets pandas openpyxl --quiet


In [None]:
# 2Ô∏è‚É£ Import the package
import pandas as pd
from datasets import load_dataset

print("The installation and import of the package have been completed. ‚úÖ")


In [None]:
# 3Ô∏è‚É£ Loading the Hugging Face dataset
# üìå Try asking Copilot:
# "Explain what load_dataset does and what object it returns."
dataset = load_dataset("RafaM97/marketing_social_media")

In [None]:
# 4Ô∏è‚É£ View the structure of the dataset
print("\nDataset Info:")
print(dataset)


In [None]:
# 5Ô∏è‚É£ Convert to Pandas DataFrame
# üìå Try asking Copilot:
# "Explain what to_pandas() does and when we should use it."
df = dataset["train"].to_pandas()
print("\n Preview of the first 5 rows of data:")
display(df.head())


In [None]:
# 6Ô∏è‚É£ Basic data exploration
print("\nData shape (rows, columns):", df.shape)
print("Column name:", df.columns.tolist())


In [None]:
# 7Ô∏è‚É£ Word length statistics
print("\nInstruction / Input / Response Length statistics:")
display(df[["instruction","input","response"]].agg(["str.len"]).describe())


In [None]:
# 8Ô∏è‚É£ Search for instructions that contain the specific keywords
keyword = "Instagram"
filtered = df[df["instruction"].str.contains(keyword, case=False)]
print(f"\n Example of instructions containing '{keyword}' (top 5):")
display(filtered[["instruction","response"]].head())


In [None]:
# 9Ô∏è‚É£ Common Word Statistics
print("\nInstruction The top 10 most common words:")
top10 = df["instruction"].str.split().explode().value_counts().head(10)
print(top10)

# üîπ Display complete case using custom function
def show_summary(idx):
    print(f"\nüìç Data index = {idx}")
    print("Instruction:\n", df.loc[idx,"instruction"])
    print("Input:\n", df.loc[idx,"input"])
    print("Response:\n", df.loc[idx,"response"])
    print("-"*80)

print("\nDisplay the 10th case:")
show_summary(10)


In [None]:
# 10Ô∏è‚É£ Data is saved as CSV
df.to_csv("marketing_social_media.csv", index=False)
print("\n‚úÖ It has been saved as marketing_social_media.csv")


In [None]:
# 11Ô∏è‚É£ Advanced Practice Example: Counting cases containing 'budget'
df["has_budget"] = df["input"].str.contains("budget", case=False)
print("\nThe proportion containing the word 'budget':")
print(df["has_budget"].value_counts(normalize=True))
