# Week 1 - Day 3 Lab: Data & Matrix Manipulation
In this lab, you'll work with a realistic weather dataset. You'll use **Pandas** to explore and clean the data, and **NumPy** to perform matrix operations.

**Dataset:** `hourly_weather_10_days.csv` (10 days of hourly weather data)

## Step 1: Load the Data
- Use Pandas to load the CSV file
- Display the first few rows
- Check the number of rows and columns

In [1]:
# TODO: Load the data into a DataFrame
import pandas as pd

# Replace the file path if needed
df = pd.read_csv('hourly_weather_10_days.csv')
df.head()

Unnamed: 0,timestamp,temperature_C,humidity_%,wind_speed_kmph,pressure_hPa,visibility_km
0,2023-03-01 00:00:00,16.6,74.4,5.7,1012.5,9.5
1,2023-03-01 01:00:00,16.2,78.5,5.0,1012.1,10.3
2,2023-03-01 02:00:00,15.3,73.3,4.7,,11.1
3,2023-03-01 03:00:00,15.8,72.4,1.3,1005.0,8.9
4,2023-03-01 04:00:00,20.9,70.6,6.8,1016.3,9.8


## Step 2: Basic Exploration
- Check column names and data types
- Display basic statistics using `.describe()`
- Count missing values in each column

In [2]:
# TODO: Explore the DataFrame
print(df.info())
print(df.describe())
print(df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   timestamp        240 non-null    object 
 1   temperature_C    228 non-null    float64
 2   humidity_%       224 non-null    float64
 3   wind_speed_kmph  226 non-null    float64
 4   pressure_hPa     223 non-null    float64
 5   visibility_km    228 non-null    float64
dtypes: float64(5), object(1)
memory usage: 11.4+ KB
None
       temperature_C  humidity_%  wind_speed_kmph  pressure_hPa  visibility_km
count     228.000000  224.000000       226.000000    223.000000     228.000000
mean       21.315789   66.795982        10.105310   1011.884753       9.989474
std         3.421237    8.190300         3.940668      5.187080       1.022166
min        11.500000   47.800000         1.300000    998.100000       6.800000
25%        18.700000   61.075000         6.625000   1008.900000       9.275

## Step 3: Handle Missing Values
- Drop or fill missing values
- Justify your approach (e.g., fill with mean, forward fill, etc.)

In [3]:
# TODO: Fill missing values
# Example: df['column'] = df['column'].fillna(df['column'].mean())

# Fill in your logic here

In [21]:
#Answer 3 
df['humidity_%'] = df['humidity_%'].fillna(df['humidity_%'].mean())
df.head()

Unnamed: 0,timestamp,temperature_C,humidity_%,wind_speed_kmph,pressure_hPa,visibility_km
0,2023-03-01 00:00:00,16.6,74.4,5.7,1012.5,9.5
1,2023-03-01 01:00:00,16.2,78.5,5.0,1012.1,10.3
2,2023-03-01 02:00:00,15.3,73.3,4.7,,11.1
3,2023-03-01 03:00:00,15.8,72.4,1.3,1005.0,8.9
4,2023-03-01 04:00:00,20.9,70.6,6.8,1016.3,9.8


## Step 4: Data Analysis
- Calculate daily average temperature
- Find max, min, mean for each metric
- Which hour of the day is the most humid on average?

In [None]:
# TODO: Perform analysis
# Use groupby, aggregation, and filtering functions
# Placeholder example:
# df['timestamp'] = pd.to_datetime(df['timestamp'])
# df['hour'] = df['timestamp'].dt.hour
# avg_humidity_by_hour = df.groupby('hour')['humidity_%'].mean()

In [8]:
import pandas as pd 
import numpy as np 
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
avg_humidity_by_hour = df.groupby('hour')['humidity_%'].mean()
print(avg_humidity_by_hour)


hour
0     78.170000
1     78.420000
2     75.414286
3     71.940000
4     69.310000
5     68.611111
6     65.770000
7     65.044444
8     63.490000
9     59.650000
10    58.710000
11    58.910000
12    59.422222
13    58.330000
14    61.366667
15    60.888889
16    59.600000
17    64.030000
18    66.971429
19    69.190000
20    67.488889
21    72.700000
22    73.750000
23    79.100000
Name: humidity_%, dtype: float64


## Step 5: NumPy Matrix Exercises
Convert relevant DataFrame columns into NumPy arrays and perform matrix operations.

In [None]:
# TODO: Extract temperature and wind_speed as NumPy arrays
import numpy as np

temp = df['temperature_C'].to_numpy()
wind = df['wind_speed_kmph'].to_numpy()

In [9]:
#Answer 5 
import numpy as np

temp = df['temperature_C'].to_numpy()
wind = df['wind_speed_kmph'].to_numpy()
print(temp)
print(wind)


[16.6 16.2 15.3 15.8 20.9 20.8 22.8 22.5 21.2 28.2  nan 25.6 25.4 23.4
 27.6 23.8 22.  23.7 23.1 22.1 18.2 19.4 19.7 14.7 15.9 15.7 17.6 20.
 22.  21.2 22.7 21.5 23.6 28.7 20.5 25.1 22.1 24.5 22.6 21.9 23.9  nan
 24.2 22.  19.9 18.8 16.5 18.  13.6 15.7 18.5 15.9 25.7 20.9 20.8 20.7
 22.3 25.7 24.8 21.2 25.5 21.9 22.4 24.7 24.4  nan 23.3 24.4 22.7 20.9
 18.6 15.4 16.7 18.7 17.9 16.  21.5 22.1 27.1 23.7 21.9 24.2 26.8 24.4
  nan 25.7 24.  19.5 23.  23.7 19.8 21.1 18.7 20.8 19.7 15.9 19.7 12.4
 18.7 18.9 21.2 21.  19.6 20.5 24.7  nan 23.3 23.1 22.8 24.5 24.9 24.7
 24.1 22.5 24.2 23.3 22.2 23.9 16.8 18.4 17.8 18.5 21.1 21.9 17.9 16.6
 23.3 24.8 22.5 26.  25.4 24.3 26.2 20.8 21.3 22.7 24.7 23.5 25.6 23.2
 23.9 15.5 19.8 17.3 15.8 17.6 17.9  nan 19.8 23.4 19.9 25.7 23.1 25.9
 24.4 22.5 24.8 24.5 24.2 22.9 20.  20.7 20.5 21.9 23.2 15.4 15.3 17.6
 13.5  nan 17.4 14.8 21.2 18.5 21.6 23.4 26.  22.3  nan 24.9 23.7  nan
 24.6 23.6 25.5 24.4 22.5 21.5 19.  17.3 15.2 17.9 14.3 17.3 15.6 21.5
 17.8 2

### a) Reshape into matrix form
- Assume each row is a day
- Reshape temperature into a (10, 24) matrix
- Calculate daily min, max, and mean using axis-based operations

In [None]:
# TODO: Reshape and aggregate
# Hint: temp_matrix = temp.reshape((10, 24))
# Write functions to find min, max, mean across rows

In [None]:
import pandas as pd 
import numpy as np 
temp_matrix = temp.reshape((10, 24))
def row_min(df):
    return df.drop(columns=['timestamp']).min(axis=1)
def row_max(df):
    return df.drop(columns=['timestamp']).max(axis=1)
def row_mean(df):
    return df.drop(columns=['timestamp']).mean(axis=1)
print(temp_matrix)

[[16.6 16.2 15.3 15.8 20.9 20.8 22.8 22.5 21.2 28.2  nan 25.6 25.4 23.4
  27.6 23.8 22.  23.7 23.1 22.1 18.2 19.4 19.7 14.7]
 [15.9 15.7 17.6 20.  22.  21.2 22.7 21.5 23.6 28.7 20.5 25.1 22.1 24.5
  22.6 21.9 23.9  nan 24.2 22.  19.9 18.8 16.5 18. ]
 [13.6 15.7 18.5 15.9 25.7 20.9 20.8 20.7 22.3 25.7 24.8 21.2 25.5 21.9
  22.4 24.7 24.4  nan 23.3 24.4 22.7 20.9 18.6 15.4]
 [16.7 18.7 17.9 16.  21.5 22.1 27.1 23.7 21.9 24.2 26.8 24.4  nan 25.7
  24.  19.5 23.  23.7 19.8 21.1 18.7 20.8 19.7 15.9]
 [19.7 12.4 18.7 18.9 21.2 21.  19.6 20.5 24.7  nan 23.3 23.1 22.8 24.5
  24.9 24.7 24.1 22.5 24.2 23.3 22.2 23.9 16.8 18.4]
 [17.8 18.5 21.1 21.9 17.9 16.6 23.3 24.8 22.5 26.  25.4 24.3 26.2 20.8
  21.3 22.7 24.7 23.5 25.6 23.2 23.9 15.5 19.8 17.3]
 [15.8 17.6 17.9  nan 19.8 23.4 19.9 25.7 23.1 25.9 24.4 22.5 24.8 24.5
  24.2 22.9 20.  20.7 20.5 21.9 23.2 15.4 15.3 17.6]
 [13.5  nan 17.4 14.8 21.2 18.5 21.6 23.4 26.  22.3  nan 24.9 23.7  nan
  24.6 23.6 25.5 24.4 22.5 21.5 19.  17.3 15.2 17.9]


### b) Normalize the temperature matrix
- Subtract the mean and divide by std deviation
- Do it manually using NumPy functions

In [None]:
# TODO: Normalize temp_matrix
# Placeholder for function: def normalize(matrix):
# return ...

# Apply it to temp_matrix

In [12]:
def normalize(matrix):
    return (matrix - matrix.min()) / (matrix.max() - matrix.min())
temp_matrix = df.drop(columns=['timestamp'])
normalized_temp_matrix = normalize(temp_matrix)
print(normalized_temp_matrix,temp_matrix)

     temperature_C  humidity_%  wind_speed_kmph  pressure_hPa  visibility_km  \
0         0.296512    0.660050         0.266667      0.498270       0.465517   
1         0.273256    0.761787         0.224242      0.484429       0.603448   
2         0.220930    0.632754         0.206061           NaN       0.741379   
3         0.250000    0.610422         0.000000      0.238754       0.362069   
4         0.546512    0.565757         0.333333      0.629758       0.517241   
..             ...         ...              ...           ...            ...   
235       0.691860    0.481390         0.521212      0.650519       0.241379   
236       0.505814         NaN         0.418182      0.539792       0.603448   
237            NaN    0.585608         0.193939      0.124567       0.672414   
238       0.546512    0.774194         0.224242      0.678201       0.362069   
239       0.000000         NaN         0.363636           NaN       0.982759   

         hour  
0    0.000000  
1    0.

### c) Apply custom mask/filter
- Create a mask for wind speed > 15 kmph
- Use it to extract high-wind readings

In [None]:
# TODO: Create boolean mask and filter wind speeds
# mask = wind > 15
# high_wind = wind[mask]

In [18]:
#Answer (c)
import pandas as pd
df = pd.read_csv('hourly_weather_10_days.csv')
wind = df['wind_speed_kmph']
mask = wind > 15
high_wind = wind[mask]
print(high_wind)

11     17.6
12     16.0
13     16.5
34     16.3
57     16.7
58     15.8
59     17.8
60     15.1
63     16.3
64     15.2
83     17.0
84     15.9
105    15.6
107    15.8
108    15.4
110    15.6
133    16.3
154    15.3
155    16.2
178    16.9
180    15.3
181    15.2
201    15.5
202    17.4
205    17.4
208    15.4
226    15.4
227    16.5
228    17.0
229    15.7
Name: wind_speed_kmph, dtype: float64


## Final Challenge: Write Your Own Function
Write a function `daily_summary(matrix)` that takes a NumPy matrix of shape (10, 24) and returns a summary dictionary for each day.

In [None]:
# TODO: Write and test your function
def daily_summary(matrix):
    # return list of dicts with min, max, mean
    pass

# Example usage:
# summaries = daily_summary(temp_matrix)

In [22]:
#Answer final 
import pandas as pd
df = pd.read_csv('hourly_weather_10_days.csv')
df['date'] = pd.to_datetime(df['timestamp']).dt.date
temp_matrix = df.drop(columns=['timestamp', 'date'])
print(temp_matrix)
print(df['date'])




     temperature_C  humidity_%  wind_speed_kmph  pressure_hPa  visibility_km
0             16.6        74.4              5.7        1012.5            9.5
1             16.2        78.5              5.0        1012.1           10.3
2             15.3        73.3              4.7           NaN           11.1
3             15.8        72.4              1.3        1005.0            8.9
4             20.9        70.6              6.8        1016.3            9.8
..             ...         ...              ...           ...            ...
235           23.4        67.2              9.9        1016.9            8.2
236           20.2         NaN              8.2        1013.7           10.3
237            NaN        71.4              4.5        1001.7           10.7
238           20.9        79.0              5.0        1017.7            8.9
239           11.5         NaN              7.3           NaN           12.5

[240 rows x 5 columns]
0      2023-03-01
1      2023-03-01
2      2023-03-0

## ✅ Submit your notebook once complete.
- Add comments where necessary