# Numpy and Pandas practical exercises

This notebook contains code examples and practical exercises to help you get acqainted with Numpy and Pandas. Run the code cells to see the output. After looking through the code cells, try completing the exercises below. The solutions are provided in the file 'numpy_pandas_walkthrough-solutions.ipynb'

In [None]:
# Kernel / venv verification
# Run this first. No need to edit
import sys, importlib
print('python executable:', sys.executable)
print('ipykernel available:', importlib.util.find_spec('ipykernel') is not None)
import pandas as pd
print('pandas version:', pd.__version__)


# Refresher

In [None]:
#open the file "pittsburgh-weather-2024.csv"
with open('pittsburgh-weather-2024.csv', 'r') as file:
    lines = file.readlines()

#look at the first 5 lines of the file
print(lines[:5])
print()

#extracting daily high temperatures from the CSV file
daily_highs = []
for line in lines[1:]: #first line is the header
    row = line.strip().split(',')
    daily_highs.append(float(row[1]))
print("Daily high temperatures:", daily_highs)
print()

#Using list of daily highs to calculate the average high temperature
total_high = 0
for t in daily_highs:
    total_high += t

average_high = total_high / (len(daily_highs))

print("Average high temperature:", average_high)

# 1D and 2D arrays

In [None]:
import numpy as np


### 1D 

In [None]:

arr1 = np.array([1, 2, 3, 4, 5])

In [None]:
print(arr1)

**array.dtype** : This tells you the data type of the elements inside the array.

**type(array)** : This tells you the type of the object itself — i.e., what kind of Python object the array is.



In [None]:
print(arr1.dtype)
print(type(arr1))

### 2D

In [None]:
arr2 = np.array([[1, 2, 3],
                [4, 5, 6]])


In [None]:
print(arr2)

In [None]:
print(type(arr2))
print(arr2.dtype)

### Upcasting
In NumPy, **Upcasting** refers to converting mixed data types to a single, more general type (like converting integers to strings) to ensure the array is homogeneous.

specific -> general

In [None]:
arr3 = np.array([[1, 2, 3, 4, 5],
                 ['one', 'two', 'three', 'four', 'five']])

In [None]:
print(arr3)

In [None]:
print(type(arr3))
print(arr3.dtype)

That means NumPy has inferred the array’s data type as Unicode string with up to 11 characters.

NumPy sees mixed types (integers and strings), and it automatically upcasts the integers to strings so all elements share a common type.

# Creating Arrays

In [None]:
np.array([1, 2, 3, 4, 5])

In [None]:
np.arange(0,10,2)

In [None]:
np.linspace(0,1,5)

In [None]:
np.zeros((3,2))

In [None]:
np.ones(4)

In [None]:
a = np.random.rand(3,4)

In [None]:
print(a)

In [None]:
print(arr2[0]) # accessing the first array
print(type(arr2[0]))
print(type(arr2[0][0]))

In [None]:
print(arr2[1]) # accessing the second array
print(type(arr2[1]))
print(type(arr2[1][0]))

In [None]:
print(arr2[0, 0]) # accessing the first element
print(type(arr2[0, 0]))

# Array Operations

### Element-wise math

**Element-wise math** allows you to perform mathematical operations on every element in a NumPy array without needing to write a `for` loop.

In [None]:
arr = np.array([1, 2, 3, 4, 5])\n
arr

In [None]:
arr + 10

In [None]:
arr - 5

In [None]:
arr * 3

In [None]:
arr / 5

In [None]:
arr ** 2

### Broadcasting
**Broadcasting** is a powerful mechanism that allows NumPy to perform arithmetic operations on arrays of different shapes. 

Normally, to add two arrays, they must be exactly the same size. Broadcasting relaxes this rule by automatically "stretching" (broadcasting) the smaller array across the larger one to make their shapes compatible.

**Why it matters:** It lets you write clean, vectorized code without explicit loops (e.g., `for i in rows...`) and avoids creating unnecessary large temporary arrays.

In [None]:
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])
arr

In [None]:
# Scalar broadcasting (Fast)
print(10)
arr + 10

In [None]:
# 0-d array broadcasting (Fast)
print(np.array(10))
arr + np.array(10)

In [None]:
# Manual full array creation (Slow/Memory intensive)
print(np.array([[10, 10, 10],[10, 10, 10],[10, 10, 10]]))
arr + np.array([[10, 10, 10],[10, 10, 10],[10, 10, 10]])

### Aggregation
**Aggregation** functions take an array (or a specific axis of an array) and reduce it to a single value, such as a sum, mean, or maximum.

**Why it matters:** NumPy aggregations are highly optimized and much faster than using Python's built-in `sum()` or `min()` functions on large datasets. They are essential for statistical analysis and data exploration.

In [None]:
# Create a sample 1D array
arr = np.array([1, 2, 3, 4, 5])
arr

In [None]:
# Calculate the mean (average)
np.mean(arr)

In [None]:
# Calculate the sum of all elements
np.sum(arr)

In [None]:
# Find the minimum value
np.min(arr)

In [None]:
# Find the maximum value
np.max(arr)

In [None]:
# Calculate the standard deviation
np.std(arr)

Also works on 2D Arrays

In [None]:
# Create a sample 2D array
arr2 = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])
arr2

In [None]:
# Mean of the entire 2D array
np.mean(arr2)

In [None]:
# Sum of all elements in the 2D array
np.sum(arr2)

In [None]:
# Minimum value in the 2D array
np.min(arr2)

In [None]:
# Maximum value in the 2D array
np.max(arr2)

In [None]:
# Standard deviation of the 2D array
np.std(arr2)

# Array Indexing and Slicing: NumPy vs. Python Lists

While NumPy slicing syntax mimics standard Python list slicing, the underlying behavior is fundamentally different.

### 1. Views vs. Copies
*   **Python Lists:** Slicing a list creates a **shallow copy**. Modifying the slice does *not* affect the original list.
*   **NumPy Arrays:** Slicing an array creates a **view**. No data is copied. Modifying the slice **changes the original array**.

**Why it matters:**
*   **Memory Efficiency:** Work with huge datasets without duplicating data.
*   **Performance:** Creating views is instantaneous.

### 2. Multidimensional Indexing
*   **Python:** `matrix[row][col]`
*   **NumPy:** `matrix[row, col]` (More concise and efficient)

In [None]:
arr

In [None]:
arr[0]

In [None]:
arr[1:3]

In [None]:
arr2

In [None]:
arr2[0]

In [None]:
arr2[0,2]

In [None]:
arr2[1,1:]

In [None]:
# Demonstration of View vs Copy
print("--- Python List (Copy) ---")
py_list = [0, 1, 2, 3, 4]
py_slice = py_list[1:4]
py_slice[0] = 99
print(f"Original List: {py_list}")   # Unchanged

print("\n--- NumPy Array (View) ---")
np_arr = np.array([0, 1, 2, 3, 4])
np_slice = np_arr[1:4]
np_slice[0] = 99
print(f"Original Array: {np_arr}")   # Changed

### Boolean Indexing
**Boolean Indexing** allows you to select elements from an array based on a condition (True/False) rather than their position (index).

**Why it matters:** This is the primary way to filter data in NumPy and Pandas. Instead of writing a loop to check every element (e.g., `if x > 5`), you can apply a condition to the whole array at once.

In [None]:
# Select elements greater than or equal to 4
arr[arr >= 4]

In [None]:
# Select even numbers from the 2D array
arr2[arr2 % 2 == 0]

### Shape and Reshape

In [None]:
print(arr)
print(arr.shape)
print(arr.size)
print(arr.ndim)

In [None]:
print(arr2)
print(arr2.shape)
print(arr2.size)
print(arr2.ndim)


### Reshape

In [None]:
arr2.reshape(3, 3) 

In [None]:
arr2.reshape(9,1)

In [None]:
arr2.flatten()

In [None]:
array = np.array([[1, 2, 3],[4,5,6]])

print(array)

In [None]:
array.reshape(3,2)

# Numpy 
complete the **TODO** sections below. 


In [None]:
print(daily_highs)

In [None]:
#Use np.array to find average daily high
daily_highs_array = #your code here
average_high = #your code here

print("Average high temperature:", average_high)

# Pandas

In [None]:
# import pandas as pd

## Creating Data

In [None]:
daily_highs

In [None]:
pgh_weather = pd.Series(    data=daily_highs,
    name='Pittsburgh Daily Highs'
)

In [None]:
pgh_weather

In [None]:
pd.set_option('display.max_rows', None)  # Show all rows in the Series

In [None]:
pgh_weather

In [None]:
pd.set_option('display.max_rows', 20) 

In [None]:
df = pd.DataFrame({
    'Day': [f'Day {i+1}' for i in range(len(daily_highs))],
    'High': daily_highs
})

In [None]:
df

In [None]:
df = pd.read_csv('pittsburgh-weather-2024.csv')

In [None]:
df

## Data Exploration

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df = pd.read_csv('pittsburgh-weather-2024.csv')

In [None]:
df.columns

# Selecting and Filtering Data

In [None]:
df['Day']

In [None]:
df[['Day', 'Snow']]

In [None]:
df[df['Day'] == 'December 13']

In [None]:
df[df['High'] >= 89]

In [None]:
df[df['Low'] == 13.0]

In [None]:
df.iloc[3]

In [None]:
df.loc[3]

In [None]:
df.loc['December 13']

In [None]:
df.set_index('Day', inplace = True)
df.head()


In [None]:
df.loc['January 4']

In [None]:
df.loc[3] #Doesn't work after setting index

In [None]:
df.iloc[3] # Still works with iloc

In [None]:
df.head()

## Pandas Exercise 

Load all_olympic_medalists.csv

Do the following tasks/ Answer the following questions:
1. Show the first 5 rows and the last 5 rows.
2. How many rows are in the dataset?
3. Are there any missing values in any of the columns? 
4. Which values are missing in the medal column? How many total medals are recorded in the dataset?
5. How many medals were awarded in the first Olympic year (1896)? How would you do this if you didn't know the first year?
6. How many medals has the US won?
7. What years were medals awarded for Rugby?
8. What was the first year women were included?

In [None]:
df = pd.read_csv('all_olympic_medalists.csv')

#### 1. show the first 5 rows and the last 5 rows

In [None]:
# show the first 5 rows
df.<FILL_THIS_IN>()


In [None]:
# show the last 5 rows

#### 2. How many rows are in the dataset?

In [None]:
# use len

# use .shape

# use .info()

#### 3. Are there any missing values in any of the columns? 


In [None]:
# Find out using the appropriate pandas functions


In [None]:
#can you think of another way to find null values?

#### 4. Which values are missing in the medal column? How many total medals are recorded in the dataset?


In [None]:
# Show the missing values in the 'medal' column
df[df['medal'].<FILL_THIS_IN>()]

In [None]:
# How many total medals are recorded in the dataset?


In [None]:
#don't use len() -- includes NaN values
len(df['medal'])

##### 5. How many medals were awarded in the first Olympic year (1896)? How would you do this if you didn't know the first year?


In [None]:
# How many medals were awarded in the first Olympic year (1896)?
df[df['year'] == <FILL_THIS_IN>

#### 6. How many medals has the US won?


In [None]:
# 6. How many medals has the US won?


#### 7. What years were medals awarded for Rugby?


In [None]:
# What years did anyone medal for "Rugby"?


#### 8. What was the first year women were included?

In [None]:
# What was the first year women were included?


## Pandas Cleaning Tasks

In [None]:
df= pd.read_csv('all_olympic_medalists.csv')

In [None]:
df.head()

In [None]:
#show missing values in the 'medal' column


In [None]:
df.fillna({'country': 'Unknown'}, inplace=True) 

In [None]:
df[df['medal'].isnull()]

In [None]:
# remove rows with missing values in the 'medal' column
df.dropna(subset=['medal'], inplace=True)

In [None]:
df[df['medal'].isnull()]

In [None]:
df.columns

In [None]:
df.rename(columns={'event_gender':'gender'}, inplace=True)

In [None]:
df.columns

In [None]:
df['year'] = df['year'].astype(str)

In [None]:
df['year'] = df['year'].astype(int)

In [None]:
df['event'] = df['gender']+ ' ' + df['event_name']

In [None]:
df.head()

In [None]:
#function to convert year (XXXX) to Y format
def convert_year_to_datetime(year):
    return pd.to_datetime(year, format='%Y')

df['year'] = df['year'].apply(convert_year_to_datetime)
df.head()

In [None]:
#change 'year' column back to just the year
df['year'] = df['year'].dt.year

In [None]:
df.head()

In [None]:
us_df = df.groupby('country').get_group('United States').groupby('year').size().sort_values(ascending=False)

In [None]:
pd.set_option('display.max_rows', None)  # Show all rows in the Series

In [None]:
us_df

In [None]:
pd.set_option('display.max_rows', 10) 

# Matplotlib      

In [None]:
import matplotlib.pyplot as plt

In [None]:
#plot the number of medals won by the US over the years
us_df = df.groupby('country').get_group('United States').groupby('year').size()
us_df.plot(kind='bar', figsize=(12, 6), color='green', title='Number of Medals Won by the US')


In [None]:
#plot number of  medalists in the Olympics over the years
us_golds = df.groupby('country').get_group('United States').groupby('medal').get_group('Gold').groupby('year').size()
us_golds.plot(kind='bar', figsize=(12, 6), color='gold', title='Number of Gold Medals Awarded to the US')

In [None]:
us_golds.head()

In [None]:
#plot the number of medals won by US in the Olympics split by medal type
df.head()

In [None]:
#filter for US women
us = df[ (df['country_code'] == 'USA') ]

# Group by year and medal type, then count
medal_counts = us.groupby(['year', 'medal']).size().unstack(fill_value=0)

# Reorder medal columns: Bronze, Silver, Gold
medal_counts = medal_counts[['Bronze', 'Silver', 'Gold']]

# Plot the stacked bar chart
medal_counts.plot(
    kind='bar',
    stacked=True,
    figsize=(12, 6),
    color={'Gold': "#F7D722", 'Silver': '#C0C0C0', 'Bronze': '#CD7F32'}
)

plt.title('US Olympic Medals Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Medals')
plt.legend(title='Medal')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


# Pandas Data Cleaning Exercise

In all questions referring to gross, use the adjusted gross (in 2022 dollars)

- Drop the column 'Ref.'
- Rename 'Adjusted gross (in 2022 dollars)' to 'Adjusted gross'.
- List the artists and the number of times they appear in the rankings.
- What Taylor Swift tours made the rankings?
- Clean the actual gross and adjusted gross columns and convert them to float.
- What is the total adjusted gross from all 20 concerts?
- Calculate average gross/show for each rank using the adjusted gross. 
- How much has Taylor Swift earned from the concerts on this list?

In [None]:
df = pd.read_csv('top-20-womens-tours.csv')

In [None]:
df.columns

In [None]:
df.head()

##### Drop the column 'Ref.'


In [None]:
#- Drop the column 'Ref.'


##### Rename 'Adjusted gross (in 2022 dollars)' to 'Adjusted gross'.


In [None]:
df.columns

In [None]:
#- Rename 'Adjusted gross (in 2022 dollars)' to 'Adjusted gross'.


##### List the artists and the number of times they appear in the rankings.


In [None]:
#- List the artists and the number of times they appear in the rankings, sorted by frequency.


In [None]:
# list artists and number of times they appear in the rankings by alphabetical order


##### - What Taylor Swift tours made the rankings?


In [None]:
#- What Taylor Swift tours made the rankings?
taylor = <YOUR_CODE_HERE>
taylor['Tour title']

In [None]:
#is there another way to do this?

##### Clean the actual gross and adjusted gross columns and convert them to float.


In [None]:
# - Clean the actual gross and adjusted gross columns and convert them to float.


def clean_currency(value):
    if isinstance(value, str):
        #remove instances of "$" and ","
        value = <YOUR_CODE_HERE>
        return float(value)
    
<YOUR_CODE_HERE>
<YOUR_CODE_HERE>

df.head()

##### What is the total adjusted gross from all 20 concerts?


In [None]:
#- What is the total adjusted gross from all 20 concerts?
<YOUR_CODE_HERE>

##### Calculate average gross/show for each rank using the adjusted gross. 


In [None]:
#- Calculate average gross/show for each rank using the adjusted gross. 
<YOUR_CODE_HERE>
df

##### How much has Taylor Swift earned from the concerts on this list?


In [None]:
#- How much has Taylor Swift earned from the concerts on this list?
<YOUR_CODE_HERE>

In [None]:
<YOUR_CODE_HERE>


### Merging and Concatenation (pd.merge vs pd.concat) 

Data rarely comes in a single, perfect file. As an AI Technician, you'll constantly be combining data from different sources. Pandas provides two primary ways to do this: `concat` for stacking data on top of each other, and `merge` for joining data based on common columns, similar to a SQL join.

# Merge examples


### Exercises (try before peeking at solutions)

1. Use `df_olympics` and the existing `df_continents` example: left-merge them and compute medal counts per continent.


2. Given two DataFrames with the same columns but different rows, concatenate them and reset the index.


3. Find a case that produces a many-to-many join (duplicate keys on both frames) and observe the result — what happens? (Hints: use `indicator=True` and `validate='one_to_many'` to check behavior.)


# Example DataFrames for merges


In [None]:
df_people = pd.DataFrame({
    'person_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie']
})

df_scores = pd.DataFrame({
    'person_id': [1, 2, 2, 4],
    'score': [95, 80, 82, 77]
})

print("people\n", df_people)
print("\nscores\n", df_scores)



# Inner join (only matching keys)


In [None]:
# 1. Merging Data
# First, ensure we have the olympic data loaded
df_olympics = pd.read_csv('all_olympic_medalists.csv')

# Create a small DataFrame to map codes to continents (just a sample)
continent_data = {
    'country_code': ['USA', 'CHN', 'GBR', 'AUS', 'CAN', 'FRA', 'GER', 'JPN'],
    'continent': ['North America', 'Asia', 'Europe', 'Oceania', 'North America', 'Europe', 'Europe', 'Asia']
}
df_continents = pd.DataFrame(continent_data)

# Merge the two DataFrames using 'country_code' as the key
# how='left' keeps all rows from the olympics data, even if we didn't map their continent
df_merged = pd.merge(df_olympics, df_continents, on='country_code', how='left')

# Check the result to see the new 'continent' column
df_merged[df_merged['continent'].notnull()].head()

In [None]:
# 2. Pivot Tables
# We want to see Countries as Rows, Years as Columns, and the Count of Medals as values
pivot_medals = df_olympics.pivot_table(
    index='country', 
    columns='year', 
    values='medal', 
    aggfunc='count', 
    fill_value=0
)

# Show a subset (first 5 countries, last 5 Olympic years)
pivot_medals.iloc[:5, -5:]

In [None]:
# 3. Time Series Handling
df_weather = pd.read_csv('pittsburgh-weather-2024.csv')

# The 'Day' column is just "Month Day" (e.g., "January 1"). 
# We need to add the year and convert to datetime.
df_weather['Date'] = pd.to_datetime(df_weather['Day'] + ', 2024', format='%B %d, %Y')

# Set the new Date column as the index
df_weather.set_index('Date', inplace=True)

# 4. Resampling
# 'M' stands for Month. We calculate the mean for every month.
monthly_temps = df_weather['High'].resample('M').mean()

monthly_temps

In [None]:
# 5. Visualization
# Plotting the resampled time series data
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
monthly_temps.plot(marker='o', color='orange', linestyle='-')

plt.title('Average Monthly High Temperature in Pittsburgh (2024)')
plt.ylabel('Temperature (°F)')
plt.xlabel('Month')
plt.grid(True)
plt.show()