### Importing Necessasry Libraries

In [1]:
import pandas as pd  # Import the pandas library for data analysis and manipulation

### Loading Our data

In [2]:
DATA_PATH = (
    "../data/all-landsat-data.csv"  # Define the file path to the Landsat data CSV file
)
df = pd.read_csv(
    "../data/all-landsat-data.csv", parse_dates=["date"]
)  # Read the CSV file into a DataFrame and parse the 'date' column as datetime
df.head()  # Display the first five rows of the DataFrame

Unnamed: 0,date,NDVI_mean,NDVI_std,NDWI_mean,NDWI_std,NDBI_mean,NDBI_std,LST_mean_C,LST_std_C,count,year,scene
0,2022-01-27,0.201626,0.077391,0.055569,0.065216,-0.055569,0.065216,12.678688,4.144818,2649889,2022,January
1,2022-02-12,0.183799,0.0994,0.056023,0.066603,-0.056023,0.066603,13.168473,4.372898,2649889,2022,Feburary
2,2022-03-08,0.214993,0.07053,0.039716,0.068622,-0.039716,0.068622,25.446893,3.893588,2649889,2022,March
3,2022-04-17,0.243295,0.07634,0.059367,0.078458,-0.059367,0.078458,31.615345,4.400812,2649889,2022,April
4,2022-05-27,0.289823,0.087954,0.107981,0.066525,-0.107981,0.066525,27.281522,4.171977,2649889,2022,May


### Basic Analysis

In [3]:
# Columns of our dataset
print("📊 Columns in the dataset:")
print("-" * 40)
print(df.columns.to_list())  # Convert column names to a list for cleaner display

📊 Columns in the dataset:
----------------------------------------
['date', 'NDVI_mean', 'NDVI_std', 'NDWI_mean', 'NDWI_std', 'NDBI_mean', 'NDBI_std', 'LST_mean_C', 'LST_std_C', 'count', 'year', 'scene']


In [4]:
# Shape of our dataset
print("📐 Shape of the dataset:")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

📐 Shape of the dataset:
Rows: 24, Columns: 12


In [5]:
# 💾 Checking the size of the dataset
# Calculate total memory usage in bytes
mem_bytes = df.memory_usage(deep=True).sum()

# Convert bytes to megabytes (MB)
mem_mb = mem_bytes / (1024**2)

# Print the formatted memory usage
print(f"Total DataFrame memory usage: {mem_mb:.10f} MB")

Total DataFrame memory usage: 0.0034046173 MB


### Data Understanding

| Column       | Description                                                                                   |
|-------------|-----------------------------------------------------------------------------------------------|
| date        | Date of observation                                                                           |
| NDVI_mean   | Mean Normalized Difference Vegetation Index → measures vegetation health (0–1). Higher = greener/healthier vegetation |
| NDVI_std    | Standard deviation of NDVI → variability in vegetation health                                 |
| NDWI_mean   | Mean Normalized Difference Water Index → measures water presence. Higher = more water content |
| NDWI_std    | Standard deviation of NDWI                                                                    |
| NDBI_mean   | Mean Normalized Difference Built-up Index → measures urban areas. Higher = more built-up/urbanized |
| NDBI_std    | Standard deviation of NDBI                                                                    |
| LST_mean_C  | Mean Land Surface Temperature in Celsius                                                     |
| LST_std_C   | Standard deviation of temperature                                                            |
| count       | Number of pixels used in the computation                                                     |
| year        | Year of observation                                                                          |
| scene       | Month or descriptive name of the scene                                                       |

### Statistical summary of numeric columns

In [6]:
# 📈 Statistical summary of numerical columns
print("Descriptive statistics of the dataset:")
print("-" * 50)
print(df.describe())

Descriptive statistics of the dataset:
--------------------------------------------------
                      date  NDVI_mean   NDVI_std  NDWI_mean   NDWI_std  \
count                   24  24.000000  24.000000  24.000000  24.000000   
mean   2023-06-06 08:00:00   0.229643   0.084712   0.067998   0.070091   
min    2022-01-27 00:00:00   0.179490   0.063529   0.028477   0.062376   
25%    2022-11-05 00:00:00   0.204214   0.076368   0.045106   0.066202   
50%    2023-04-20 00:00:00   0.223381   0.086268   0.058228   0.069401   
75%    2024-02-18 00:00:00   0.246555   0.091078   0.088482   0.072839   
max    2024-12-18 00:00:00   0.303173   0.109813   0.128518   0.082866   
std                    NaN   0.035778   0.011239   0.030161   0.005352   

       NDBI_mean   NDBI_std  LST_mean_C  LST_std_C      count         year  
count  24.000000  24.000000   24.000000  24.000000       24.0    24.000000  
mean   -0.067998   0.070091   22.345648   4.740335  2649889.0  2022.958333  
min    -0.12

In [7]:
# 🔢 Check value counts of the 'scene' column
print("Frequency of each unique value in the 'scene' column:")
print("-" * 50)
scene_counts = df["scene"].value_counts()
print(scene_counts)

Frequency of each unique value in the 'scene' column:
--------------------------------------------------
scene
Feburary    3
October     3
March       3
April       3
May         3
December    3
November    3
January     2
June        1
Name: count, dtype: int64


In [8]:
# 📅 Check the distribution of years in the dataset
print("Number of records per year:")
print("-" * 40)
year_counts = df["year"].value_counts()
print(year_counts)

Number of records per year:
----------------------------------------
year
2023    9
2022    8
2024    7
Name: count, dtype: int64
