In this notebook, we start the **data regathering process**.  
Since the **previous modeling** highlighted some challenges, we decided to **focus first on 3 governorates only**:  

- **Dakahlia**  
- **Fayoum**  
- **Matrouh**  

The goal of this step is to **rebuild and organize the dataset** for these governorates before extending it to the full dataset (5 governorates).  

This regathering process ensures that the **data is consistent, clean, and well-structured**, which will later allow the **preprocessing pipeline** and the **model training** to be applied more reliably.  


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import os, glob
import pandas as pd

In [None]:
# Connect to drive that we have uploaded the data on
data_path = '/content/drive/MyDrive/grad_project_data'
files = os.listdir(data_path)
print(len(files), "files found")

12 files found


In [None]:
govs = ['Dakahlia', 'Fayoum', 'Matrouh']

In [None]:
all_data = pd.DataFrame()

In [None]:
for gov in govs:
    gov_path = os.path.join(data_path, gov)

    csv_files = glob.glob(os.path.join(gov_path, '*.csv'))
    print(f"{gov}: {len(csv_files)} files")

    for file in csv_files:
        df = pd.read_csv(file)
        all_data = pd.concat([all_data, df], ignore_index=True)

Dakahlia: 36 files
Fayoum: 36 files
Matrouh: 36 files


In [None]:
print("Final shape:", all_data.shape)
all_data.head()

Final shape: (146136, 20)


Unnamed: 0,longitude,latitude,year,month,area,ndvi,t2m_c,td2m_c,rh_pct,tp_m,ssrd_jm2,LC_Type1,sand,silt,clay,soc,ph,bdod,cec,POP
0,31.158066,30.2867,2025,1,Dakahlia,0.70215,15.561177,7.861838,60.125874,0.015587,5885828000.0,40,424,298,279,369,72,134,195,3.509249
1,31.11315,30.610094,2025,1,Dakahlia,0.51,15.568003,8.534099,62.907192,0.034528,5727603000.0,50,385,279,337,334,75,129,185,10.245003
2,31.31078,30.295683,2025,1,Dakahlia,0.59845,15.485617,7.735447,59.899883,0.014093,5903953000.0,50,426,255,319,298,74,131,186,157.432
3,31.25688,30.43043,2025,1,Dakahlia,0.57405,15.449718,7.959476,60.96107,0.022374,5851018000.0,40,399,270,330,372,74,133,187,22.209887
4,31.445526,30.565178,2025,1,Dakahlia,0.5356,15.528049,8.544456,63.112797,0.024161,5755267000.0,50,327,284,390,331,74,133,190,161.65343


In [None]:
all_data['area'].value_counts()

Unnamed: 0_level_0,count
area,Unnamed: 1_level_1
Dakahlia,100616
Fayoum,40140
Matrouh,5380


In [None]:
# Take data where year is 2025 and not from Aug to Dec 'last 4 months in 2025' (with keeping data of 2023 & 2024)
filtered_df = all_data[~((all_data['year'] == 2025) & (all_data['month'].between(8, 12)))]

In [None]:
filtered_df.shape

(125704, 20)

In [None]:
filtered_df.duplicated().sum()

np.int64(21341)

In [None]:
des_df = filtered_df.drop_duplicates()

In [None]:
# Save the filtered dataset
des_df.to_csv('/content/drive/MyDrive/grad_project_data/des3_df.csv', index=False)