**Visualization and odds and ends**

For our last lab, let's look explicitly at something we've been using all the way through: data visualization

And perhaps more importantly, let's tie up some lose ends. I'd like to offer you some code/final thoughts on the "next level" or future of methods we've used in this class, for example, via prompt based methods; and a few other tidbits it might be useful for you to have going forward.

**Visualization**

There's a few good libraries for making visualizations. The main one we've been usng in this class is called "matplotlib"

We can't cover how to generate every type of viz with Matplotlib, but we can cover the basics of a few, and then give you access to the documentation where you can locate how to do them on your own. So, here we go....

So, What is **Matplotlib**?
Matplotlib is a powerful library in Python used to create static, interactive, and animated visualizations. It can help display data in a clear, easy-to-understand way through charts, graphs, and plots.

Key Features:
-Works well with NumPy arrays and Pandas DataFrames.
-Offers a wide range of plots: line, bar, scatter, pie, histograms, etc.
-Provides fine control over plot appearance (labels, colors, ticks, etc.).

Let's start by installing and importing matplotlib

In [None]:
!pip install matplotlib
import matplotlib.pyplot as plt


Accessing Documentation
You can easily access the Matplotlib documentation through the following link: https://matplotlib.org/stable/index.html

This is helpful if you want to explore the available types of visualizations and get the sample code/instructions, such as scatter plots, histograms, etc.

In Colab, you can also run the following to see the functions available in matplotlib.pyplot:


In [None]:
help(plt)


Example 1: Line Plot
Let’s start with a simple line plot to show how a trend might look over time.

Code for Line Plot:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Example data
x = np.linspace(0, 10, 100)  # 100 points from 0 to 10
y = np.sin(x)  # sine wave data

# Create the plot
plt.plot(x, y)

# Add title and labels
plt.title('The Wave')
plt.xlabel('x')
plt.ylabel('sin(x)')

# Show the plot
plt.show()


ok what f instead of a sin wave we just wanted a linear chart - for all the ys to be the xs multiplied by 7

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Example data for x (100 points from 0 to 10)
x = np.linspace(0, 10, 100)

# Create a linear relationship for y (y = 7 * x)
y = x * 7

# Create the plot
plt.plot(x, y)

# Add title and labels
plt.title('Linear Chart: y = 7 * x')
plt.xlabel('x')
plt.ylabel('y')

# Show the plot
plt.show()


what's the difference? how was the code altered?

ok what if we already have the data - a more common scenario. e.g, a list of points as x and a list of points as y; and we want to plot them against each other

In [None]:
import matplotlib.pyplot as plt

# Two lists of data (x and y)
x = [1, 2, 3, 4, 5]  # Example x values
y = [2, 4, 6, 8, 10]  # Example y values

# Create the plot
plt.plot(x, y)

# Add title and labels
plt.title('Plotting Two Lists of Points')
plt.xlabel('x')
plt.ylabel('y')

# Show the plot
plt.show()


Now let's add some styling

In [None]:
import matplotlib.pyplot as plt

# Create the figure and set the figure size first
plt.figure(figsize=(8, 6))  # Width: 8 inches, Height: 6 inches

# Example data (x and y)
x = [0, 1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25, 36]  # y = x^2

# Create the plot with some advanced styling
plt.plot(x, y,
         color='purple',       # Line color
         linestyle='--',       # Dashed line style
         linewidth=3,          # Line width (thickness)
         marker='o',           # Markers at each data point
         markerfacecolor='red',# Marker color
         markeredgewidth=2)    # Marker edge thickness

# Add title and labels
plt.title('Advanced Line Plot Styling')
plt.xlabel('x')
plt.ylabel('y')

# Set axis limits
plt.xlim(-1, 6)  # Set x-axis limits from -1 to 6
plt.ylim(0, 40)  # Set y-axis limits from 0 to 40

# Show the plot
plt.show()


Now you try: make a lin eplot of your own with your own data points and choices about color/line style etc

**Bar chart**

Now let's try a bar chart

We'll first create a set of labels (for the categories) and numbers (for the corresponding values).

Code for Dataset and Bar Chart:

In [None]:
import matplotlib.pyplot as plt

# Dataset: Categories and values
categories = ['A', 'B', 'C', 'D', 'E']
values = [3, 7, 2, 5, 6]

# Create the bar chart
plt.bar(categories, values)

# Add title and labels
plt.title('Bar Chart Example')
plt.xlabel('Categories')
plt.ylabel('Values')

# Show the plot
plt.show()


Now we can add some styling changes

 We'll change the color of the bars, add a grid for better readability, and adjust the figure size

Code for Styling Changes:

In [None]:
import matplotlib.pyplot as plt

# Dataset: Categories and values
categories = ['A', 'B', 'C', 'D', 'E']
values = [3, 7, 2, 5, 6]

# Create the bar chart with styling
plt.figure(figsize=(8, 6))  # Set figure size
plt.bar(categories, values,
        color='skyblue',         # Set bar color
        edgecolor='black',       # Set edge color of bars
        linewidth=1.5)           # Set border thickness of bars

# Add title and labels
plt.title('Styled Bar Chart Example', fontsize=14)
plt.xlabel('Categories', fontsize=12)
plt.ylabel('Values', fontsize=12)

# Add grid for better readability
plt.grid(True, axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()


Styling Explanation:
color='skyblue': This changes the color of the bars to a light blue. You can use other colors or hex codes ('#FF5733' for example) for custom colors.

edgecolor='black': Adds a black border to each bar for better definition.

linewidth=1.5: Makes the borders of the bars thicker.

plt.figure(figsize=(8, 6)): Adjusts the overall size of the plot.

plt.title() and plt.xlabel()/plt.ylabel(): Adds a title and labels with increased font sizes for clarity.

plt.grid(True, axis='y', linestyle='--', alpha=0.7): Adds a dashed grid along the y-axis. The alpha=0.7 makes the grid lines a little transparent, so they don't overpower the plot.

**Geospatial data**

someone asked about plotting geospatial data; let's look at a sample

We'll use plotly

In [None]:
!pip install plotly


In [None]:
import plotly.express as px

# Fake population data for a few Asian countries
asia_population = {
    'China': 1393409038,
    'India': 1366417754,
    'Indonesia': 270625568,
    'Pakistan': 216565318,
    'Bangladesh': 163046161,
    'Japan': 126850000,
    'Philippines': 106651922,
    'Vietnam': 96491346,
    'Turkey': 83154997,
    'Iran': 83992953
}

# Create a DataFrame from the population data
import pandas as pd
data = pd.DataFrame(list(asia_population.items()), columns=['Country', 'Population'])

# Plot a choropleth map
fig = px.choropleth(data_frame=data,
                    locations='Country',  # Column with country names
                    locationmode='country names',  # Using country names for locations
                    color='Population',  # Column to color the countries by
                    color_continuous_scale='YlGnBu',  # Color scale (yellow-green-blue)
                    labels={'Population': 'Population'},
                    title='Asian Countries Colored by Population')

# Show the map
fig.show()


In [None]:
import plotly.express as px
import pandas as pd

# Fake population data for a few U.S. states (using state abbreviations)
us_population = {
    'CA': 39512223,  # California
    'TX': 28995881,  # Texas
    'FL': 21477737,  # Florida
    'NY': 19453561,  # New York
    'PA': 12801989,  # Pennsylvania
    'IL': 12671821,  # Illinois
    'OH': 11689100,  # Ohio
    'GA': 10617423,  # Georgia
    'NC': 10488084,  # North Carolina
    'MI': 9986857    # Michigan
}

# Create a DataFrame from the population data
data = pd.DataFrame(list(us_population.items()), columns=['State Abbreviation', 'Population'])

# Plot a choropleth map of the U.S. using state abbreviations
fig = px.choropleth(data_frame=data,
                    locations='State Abbreviation',  # Column with state abbreviations
                    locationmode='USA-states',  # Using U.S. state codes for locations
                    color='Population',  # Column to color the states by
                    color_continuous_scale='YlGnBu',  # Color scale (yellow-green-blue)
                    labels={'Population': 'Population'},
                    title='U.S. States Colored by Population')

# Zoom in on the U.S. by setting the scope to 'usa'
fig.update_geos(scope='usa')

# Show the map
fig.show()


How to Adapt the Choropleth Map to Your Own Data:
Change Locations:

Replace the State Abbreviation in the locations argument with your geographic data (e.g., Country Name for countries or City Name for cities).

Example: If you’re using countries, change locationmode='USA-states' to locationmode='country names'.

Replace Values:

Replace the Population column in the color argument with your own data (e.g., GDP, Average Income).

Ensure this column contains numeric values that represent the data you want to visualize.

Change Data:

Update the data dictionary or DataFrame with your locations and corresponding values.

Example: For countries, use their names in the locations column and the metric (e.g., GDP) in the values column.

Customize the Color Scale:

Replace the color_continuous_scale with your preferred color scale (e.g., 'Viridis', 'RdYlBu').

Zoom on a Specific Region:

Change the scope argument to 'world', 'usa', or another region depending on what you want to zoom in on (e.g., 'europe' for European countries).

**Other odds and ends**

1.**scrapers**

A better video scraper easier to use in colab as well. yt-dpl. for youtube, tiktok and instareels.

In [None]:
pip install yt-dlp


In [None]:
import os
import subprocess

def download_video(url, output_dir="."):
    os.makedirs(output_dir, exist_ok=True)
    result = subprocess.run([
        "yt-dlp",
        "--no-playlist",  # Avoid downloading playlists
        "-f", "best",     # Automatically select the best format
        "-o", f"{output_dir}/%(uploader)s_%(id)s.%(ext)s",  # Save with uploader name and video ID
        url
    ], capture_output=True, text=True)

    if result.returncode == 0:
        print("✅ Success\n")
    else:
        print("❌ Failed\n", result.stderr)

# Set output directory
output_dir = "video_downloads"

# TikTok URL
download_video("https://www.tiktok.com/@zachking/video/6768504823336815877", output_dir=output_dir)

# YouTube URL
download_video("https://www.youtube.com/watch?v=PPi7zW1gS6k", output_dir=output_dir)

# Instagram Reel URL
download_video("https://www.instagram.com/reels/DB4uTijy1Hn/", output_dir=output_dir)


In [None]:
import os

# Specify the directory where the videos were saved
output_dir = "video_downloads"

# List all files in the output directory
files = os.listdir(output_dir)

# Print the files in the directory
print("Files downloaded:", files)


docmentation is here:
https://github.com/yt-dlp/yt-dlp

**Prompt based methods**

As we discussed, prompt based methods may be the future. after all, why build a classiifer when you can prompt an llm to classify text for you by asking it: is this text an X or a Y? why create video embeddings when you can just ask a prompt-based model to describe a video.

As we've discussed, thouguh, using prompt-based methods like this can be difficult beause a) you often need to do so through an API and get an API key for access and b) it often requires a credit card hook up and costs money.

Fortunately, though, Google Gemini does let you test out their prompt based APIS on a free tier, and you don't need to hook up a credit card.

BUT you do need to get an API key. We're going to test this out together, but it's a good time, when we do, to discuss API KEY SAFETY. We're also going to delete our keys after we use them today to avoid any risk

First, here's the full documentation for the Gemini API

https://ai.google.dev/gemini-api/docs

Here's where to get an API key:

https://aistudio.google.com/apikey

We're gonna get API keys together, and then work through making the below sample code work with them. After, we're going to delete them.

In [None]:
pip install -q -U google-genai

In [None]:
from google import genai

**SHARE THIS WITH NO ONE AND REMOVE WHEN YOU'RE DONE**

Here were doing what's called hardcoding your api key into your code. it's fine for testing and brief use, but long term safer methods should be used. Gemini gives instructions here on how to more safely use and deploy yourkey:
https://ai.google.dev/gemini-api/docs/api-key

for now since we're hard coding it in you should remove it from the code before you leave this window, and then we'll delete our keys

In [None]:

#set you api key:
API_KEY = "put here as string"

Let's also set the model we want to work with

In [None]:
#set model
MODEL_NAME = "gemini-2.0-flash"  # or "gemini-1.5-pro", etc.

And now let's set a prompt text

In [None]:
#set prompt
PROMPT_TEXT = "Explain how AI works in a few words."

Now let's plug these all in to the code sequence to run a prompt throuhg a particular model wiht our api key

In [None]:
#run

client = genai.Client(api_key=API_KEY)

response = client.models.generate_content(
    model=MODEL_NAME,
    contents=PROMPT_TEXT
)

print(response.text)

OK but what if we want to make things a bit more complicated - like say use this as a classifier for text. so, we have a few sentences and we want to run them through; and for each one, we want to create a prompt where we ask the model to classify them in a catgegory, Yes or no; given how good llms are at parsing text, couldn't this work better than building our own classification model? let's see:

In [None]:
# Your fake list of sentences
sentences = [
    "I made a delicious pasta for dinner last night.",
    "The cat jumped over the couch.",
    "Here's how you boil an egg perfectly.",
    "We discussed political philosophy at lunch."
]

# Function to classify a sentence
def classify_sentence(sentence):
    prompt = (
        f"Classify the following sentence. "
        f"If it is about cooking or preparing food, reply with only YES. "
        f"If it is not about cooking or preparing food, reply with only NO.\n\n"
        f"Sentence: {sentence}"
    )
    try:
        response = client.models.generate_content(
            model=MODEL_NAME,
            contents=prompt
        )
        return response.text.strip()
    except Exception as e:
        print(f"Error classifying: {sentence[:30]}... {e}")
        return None

# Apply to each sentence
results = []
for sentence in sentences:
    result = classify_sentence(sentence)
    results.append((sentence, result))

# See results
for sentence, classification in results:
    print(f"Sentence: {sentence}\nClassified as: {classification}\n")


And finally, what about processing an image or video and getting a description? seems like a way better way to get rich data concerning images/videos, and analyze them, than the image models we've been using.

You can check out the gemini documentation for how to load an image in, but let's try an example with video. First we want to change which gemini model we're accessing; and we also want to pick a video file in advance. let's use one we already have in the system that we just scraped.

In [None]:
!ls

In [None]:
!ls video_downloads

so looks like we have a video at this path:

video_downloads/zachking_6768504823336815877.mp4

In [None]:
#change our model

#first, let's change the model
MODEL_NAME = "gemini-2.5-pro-exp-03-25"  # or "gemini-1.5-pro", etc.

In [None]:
#name our video as a variable

VIDEO_PATH = "video_downloads/zachking_6768504823336815877.mp4"
PROMPT_TEXT = "Describe this video."  # <-- You can easily change this now!

In [None]:
#from google import genai


# ====================
# Gemini Setup
# ====================

client = genai.Client(api_key=API_KEY)

# Only for videos of size <20Mb
video_bytes = open(VIDEO_PATH, 'rb').read()

response = client.models.generate_content(
    model=MODEL_NAME,
    contents=genai.types.Content(
        parts=[
            genai.types.Part(
                inline_data=genai.types.Blob(data=video_bytes, mime_type='video/mp4')
            ),
            genai.types.Part(
                text=PROMPT_TEXT
            )
        ]
    )
)

print(response.text)


Try experimenting with the prompt...