## Webscarping NOAA website

We are webscraping global temperature anomalies and precipitation data from the ncei.noaa.gov. 
Visit the website [here](https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/service-api). 
For this project we are working with Caribbean Islands. According to NOAA, **Caribbean Reagion** include the following islands + countries:
- Antigua and Barbuda
- The Bahamas
- Barbados
- Belize
- Cuba
- Dominica
- Dominican Republic
- Grenada
- Guyana
- Haiti
- Jamaica
- Saint Kitts and Nevis
- Saint Lucia
- Saint Vincent and the Grenadines
- Suriname
- Trinidad and Tobago
- U.S. territories: Puerto Rico and the U.S. Virgin Islands 

Since the precipitation data has become available later than temperature anomaly data, for the completeness of the dataset, we will download the date range consistent with the precipitation data (1980 - 2025).

**Author:** Nazgul Sagatova  
**Last updated:** 2025-11-22 

### Import necessary libraries

In [1]:
import requests
import io
import pandas as pd
import numpy as np
import seaborn as sns

### Temperature Anomaly & Precipitation download

We will get data related to **Temperature Anomaly** and **Precipitation** for the Caribbean Region in one go.

To get specific climate feature and region we need to pass parameters. Parameters are passed using a GET request with a combination of path and optional query parameters.

In [3]:

# Download all data (tavg + pcp) in one loop. This will result in a dictionary of dataframes, 
# each of which correspond to either precipitation or temperature anomaly

region = 'caribbeanIslands'
parameters={'tavg', 'pcp'}
surface = 'land_ocean'
timescale = 'ytd'
month=0
format='csv'
begYear=1980
endYear=2025

skiprows = 3

d = {name: pd.DataFrame() for name in parameters}

for parameter in parameters:

    url = f'https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/{region}/{parameter}/{surface}/{timescale}/{month}/{begYear}-{endYear}/data.{format}'

    response = requests.get(url)

    if response.status_code == 200:
        print("Success!")
        if parameter == 'pcp':
            skiprows = 2
        else:
            skiprows = 3
        print(f'skipping {skiprows} rows')
        d[parameter] = pd.read_csv(url, skiprows=skiprows)
    elif response.status_code == 404:
        print("Not Found.")


tavg
https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/caribbeanIslands/tavg/land_ocean/ytd/0/1980-2025/data.csv
Success!
skipping 3 rows
     Date  Anomaly
0  198001     0.29
1  198002     0.31
2  198003     0.28
3  198004     0.29
4  198005     0.35
pcp
https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series/caribbeanIslands/pcp/land_ocean/ytd/0/1980-2025/data.csv
Success!
skipping 2 rows
     Date   Value
0  198001   44.59
1  198002   79.07
2  198003   99.58
3  198004  155.65
4  198005  256.91


In [4]:
# Merge both temperature anomaly and precipitation in one dataframe, match by date.

print('Merging the all the data...')
df = pd.merge(d['tavg'], d['pcp'], how = 'left', on='Date')
print('Success')

Merging the all the data...
Success


### Convert date to datetime object

The raw year and month need to converted into datetime format. First, we will need to separate year and month into 2 columns:  Year and Month.
The Date column is, luckily, in an int64 format. We can perform some math to get what we want.

In [5]:
# Convert entire column
df['date'] = pd.to_datetime(df['Date'], format='%Y%m')

# Set as index
df = df.set_index('date')

#remove the old Date format
df.drop('Date', axis='columns', inplace=True)


### Rename column names to something meaningful
Anomaly -> temp_anomaly
Value -> precip

In [6]:
#rename the column names
df.rename(columns={"Value" : "precip", "Anomaly" : "temp_anomaly"}, inplace=True)


In [7]:
df.head()

Unnamed: 0_level_0,temp_anomaly,precip
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1980-01-01,0.29,44.59
1980-02-01,0.31,79.07
1980-03-01,0.28,99.58
1980-04-01,0.29,155.65
1980-05-01,0.35,256.91


## Data Structure and Overview
In this section we examine the dataset, its format including columns, data types and time span.

In [8]:
print(f'The shape of the dataset = {df.shape}')
print(f'The dataset contains {df.columns}')

The shape of the dataset = (550, 2)
The dataset contains Index(['temp_anomaly', 'precip'], dtype='object')


In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.sample(5)

In [None]:
df.info()

In [None]:
df.describe(include='all')

### Missing Values and Data Quality

In this section we check for null or missing entries. In time series data, gaps might indicate issues with the gathering device i.e. sensor. We will assess inconsistencies such as negative precipitation and data completeness.

In [9]:
# Missing values
df.isnull().sum()

temp_anomaly    0
precip          0
dtype: int64

In [11]:
df.isna().sum()

temp_anomaly    0
precip          0
dtype: int64

### Time Span Analysis

Identify the dates when
- the min/max temperature anomaly was recorded
- the min/max precipitation amount was recorded

For these purposes, let's create separate columns for year and month

In [12]:
df['year'] = df.index.year
df['month'] = df.index.month

In [13]:
df.head()

Unnamed: 0_level_0,temp_anomaly,precip,year,month
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980-01-01,0.29,44.59,1980,1
1980-02-01,0.31,79.07,1980,2
1980-03-01,0.28,99.58,1980,3
1980-04-01,0.29,155.65,1980,4
1980-05-01,0.35,256.91,1980,5


In [14]:
print(f"The first date the data was recorded is {df.index.min()}")
print(f"The most recent date the data was recorded is {df.index.max()}")

The first date the data was recorded is 1980-01-01 00:00:00
The most recent date the data was recorded is 2025-10-01 00:00:00


### Identify invalid values (outliers)
We need to find whether the dataset contains invalid data such as negative precipitation and temperature anomalies that are too high.

In [15]:
# Identify invalid values such as very high temperature anomalies and negative precipitation

invalid_temp = (df['temp_anomaly'] < -10) | (df['temp_anomaly'] > 10)
invalid_precip = (df['precip'] < 0)
print(f"Invalid temp anomaly: {invalid_temp.sum()}, Invalid precip: {invalid_precip.sum()}")

Invalid temp anomaly: 0, Invalid precip: 0


### Identify duplicate records

In [16]:
# Find duplicate records

dups = df.duplicated(subset=['year', 'month'], keep=False)
print(f"Duplicate rows: {dups.sum()}")

Duplicate rows: 0


In [18]:
df.head()

Unnamed: 0_level_0,temp_anomaly,precip,year,month
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980-01-01,0.29,44.59,1980,1
1980-02-01,0.31,79.07,1980,2
1980-03-01,0.28,99.58,1980,3
1980-04-01,0.29,155.65,1980,4
1980-05-01,0.35,256.91,1980,5


In [19]:
# Final save — everyone will read this file
df.to_parquet("../data/processed/caribbean_temp_precip_1980_2025.parquet", index=True)
print("Processed dataset saved → ready for EDA notebooks")

Processed dataset saved → ready for EDA notebooks


## Initial Analysis Insigths

- There are 550 observations of weather data i.e. 550 months of temperature anomalies and precipitation data.
- The dataset is complete i.e. no null or missing values. 
- The first record is from January 1980
- The last record is from October 2025
- No duplicate records were identified
- No invalid values such as negative precipiation and extreme anomalies were identified
- Precipitation data has much higher magnitude than temp anomaly. Will need to perform log-transform in feature engineering.