# **Required Packages**

In [None]:
import pandas as pd             #package for data analysis
import numpy as np              #package for handling numpy arrays
import seaborn as sns           #package for data visualization
import matplotlib.pyplot as plt #packae for data visualization
import re                       #package for handling regular expression

# **Reading the Data**

In [None]:
data=pd.read_csv('DataET0.csv',delimiter=';')

### **Data Display**

In [None]:
data.head()

### **Data Structure**

In [None]:
data.shape

In [None]:
#no. of rows --> 1781 and no. of columns-->19

### **Features in the Data**

In [None]:
data.columns

### **Categorical Data Feature**

In [None]:
data.select_dtypes(object).columns

### **Data Description**

### The dataset contain several meteorological variables recorded on a daily basis. The variables are:

- moy_Temp[°C]: average temperature in degrees Celsius
- max_Temp[°C]: maximum temperature in degrees Celsius
- min_Temp[°C]: minimum temperature in degrees Celsius
- moy_DewPoint[°C]: average dew point temperature in degrees Celsius
- min_DewPoint[°C]: minimum dew point temperature in degrees Celsius
- moy_SolarRadiation[W/m2]: average solar radiation in Watts per square meter
- moy_VPD[kPa]: average vapor pressure deficit in kilopascals
- min_VPD[kPa]: minimum vapor pressure deficit in kilopascals
- moy_RelativeHumidity[%]: average relative humidity as a percentage
- max_RelativeHumidity[%]: maximum relative humidity as a percentage
- min_RelativeHumidity[%]: minimum relative humidity as a percentage
- Somme_Precipitation[mm]: total precipitation in millimeters
- moy_WindSpeed[m/s]: average wind speed in meters per second
- max_WindSpeed[m/s]: maximum wind speed in meters per second
- max_WindSpeedMax[m/s]: maximum wind gust speed in meters per second
- ETP quotidien [mm]: daily evapotranspiration in millimeters

This dataset provides daily meteorological observations of various parameters, which can be useful for a variety of applications related to weather and climate studies. The temperature variables (moy_Temp, max_Temp, min_Temp) provide information on the daily temperature range and average temperature, while the dew point variables (moy_DewPoint, min_DewPoint) give insight into the moisture content of the air. The solar radiation (moy_SolarRadiation) and vapor pressure deficit variables (moy_VPD, min_VPD) provide information on the energy balance and water vapor content of the atmosphere. The relative humidity variables (moy_RelativeHumidity, max_RelativeHumidity, min_RelativeHumidity) offer information on the amount of moisture in the air relative to its maximum capacity at a given temperature, and the precipitation variable (Somme_Precipitation) gives the total amount of precipitation on a given day. The wind speed variables (moy_WindSpeed, max_WindSpeed, max_WindSpeedMax) give an indication of the intensity of the wind, and the evapotranspiration variable (ETP quotidien) is a measure of the amount of water lost from the soil and vegetation through the combined processes of evaporation and transpiration. The year, month, and day variables provide the date of the observation. Overall, this dataset can be useful for various applications in agriculture, hydrology, climate modeling, and other fields that require high-resolution meteorological data.

### **Datatype  Anomaly Alert!**
In our dataset, numerical feature values are inputed as strings, the data that should be of type interger or float is preseent in object data type. We will figure out this data type anomly problem later in this code. This is the main intuition to select this data because some single vlaue features contain huge strigs even in target feature.

### **Numerical Data Feature**

In [None]:
data.select_dtypes(np.number).columns

# **Data Preprocessing**
- Exploring Features Data Values and Data Types
- Finding Null Values
- Fill Missing Values
- Data Normalization

### **Exploring Features Data Values and Data Types**

In [None]:
data.dtypes

It is strange is to see our feature are numerical and should have numeric data type either int or float. But these features has object as ata type which refers to text type of data. We need to convert them into numerical data types.

### **Checking Unique Values in the Data Feature**

In [None]:
feat_with_object_dtype=data.select_dtypes(np.object).columns
for feature in feat_with_object_dtype:
    print(f'{feature} : {data[feature].unique()}')

When we checked unique values in our data feature we find some feature with some irregular values.So we need to normalize them. This is the ['max_WindSpeedMax[m/s]','max_WindSpeed[m/s]','ETP quotidien [mm]','moy_WindSpeed[m/s]'] list of feature with irregular values

### **Removing Irregular Data Values in the Features**

In [None]:
feat_with_wrong_inputs=['max_WindSpeedMax[m/s]','max_WindSpeed[m/s]','ETP quotidien [mm]','moy_WindSpeed[m/s]']
for feature in feat_with_wrong_inputs:
    #data[feature]=data[feature].apply(lambda x:str(x))
    data[feature]=data[feature].apply(lambda x: 0 if len(str(x))>3 else x)

In [None]:
data.isnull().sum()/data.shape[0]

### **Missing Values Imputation**

In [None]:
feat_with_miss_val=['moy_WindSpeed[m/s]','max_WindSpeed[m/s]','max_WindSpeedMax[m/s]','ETP quotidien [mm]']
for feature in feat_with_miss_val:
    data[feature].fillna(0,inplace=True)

### **Dropping Data Features**

If a data feature contains null values more than 30 percent then it is good to drop that feature otherwise that feature can alter final output. So we are going to drop two data feature.

- moy_WindDirection[deg]
- dernier_WindDirection[deg]

In [None]:
data.drop(['moy_WindDirection[deg]','dernier_WindDirection[deg]'],axis=1,inplace=True)

Becaue our data feature contain numerical values in form of string so we need to convert them in numerical form.

In [None]:
for feature in data.select_dtypes(np.object).columns[1:]:
    data[feature]=data[feature].apply(lambda x:re.sub(',','.',str(x)))
    data[feature]=pd.to_numeric(data[feature])

# **Regularized Data Types**

In [None]:
data.dtypes

### **Extracting Day,Month,Year From Date Feature**

In [None]:
data['Date/heure']=data['Date/heure'].apply(lambda x:str(x).split()[0])
data['Date/heure']=pd.to_datetime(data['Date/heure'])
data['year']=data['Date/heure'].dt.year
data['month']=data['Date/heure'].dt.month
data['day']=data['Date/heure'].dt.day
data.drop(['Date/heure'],axis=1,inplace=True)

### **Feature Correlation**

In [None]:
plt.figure(figsize=(20,10))
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True)
plt.show

Correlation: It is measurement of effect of one feature upon other feature. Its value lies between -1,1. 1 or positive values for more correlated(light boxes) and -1 or negative values for not correlated(dark boxes) feature


### **Normalized Data**

In [None]:
data

# **Data Visualization**

In [None]:
avg_temps=data.groupby(['year']).mean()

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['moy_Temp[°C]'],color='blue')
plt.title('Average moy_Temp[°C] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['max_Temp[°C]'])
plt.title('Average moy_Temp[°C] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['min_Temp[°C]'],color='green')
plt.title('Average min_Temp[°C] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['moy_DewPoint[°C]'],color='purple')
plt.title('Average moy_DewPoint[°C] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['min_DewPoint[°C]'])
plt.title('Average min_DewPoint[°C] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['moy_SolarRadiation[W/m2]'],color='red')
plt.title('Average moy_SolarRadiation[W/m2] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['moy_VPD[kPa]'],color='purple')
plt.title('Average moy_VPD[kPa] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['min_VPD[kPa]'])
plt.title('Average min_VPD[kPa] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['moy_RelativeHumidity[%]'],color='violet')
plt.title('Average moy_RelativeHumidity[%] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['max_RelativeHumidity[%]'],color='green')
plt.title('Average max_RelativeHumidity[%] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['min_RelativeHumidity[%]'],color='black')
plt.title('Average min_RelativeHumidity[%] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['Somme_Precipitation[mm]'])
plt.title('Average Somme_Precipitation[mm] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['moy_WindSpeed[m/s]'],color='blue')
plt.title('Average moy_WindSpeed[m/s] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['max_WindSpeed[m/s]'],color='green')
plt.title('Average max_WindSpeed[m/s] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['max_WindSpeedMax[m/s]'])
plt.title('Average max_WindSpeedMax[m/s] over the years')

In [None]:
sns.lineplot(x=avg_temps.index,y=avg_temps['ETP quotidien [mm]'],color='purple')
plt.title('Average ETP quotidien [mm] over the years')

# **Hypothesis Testing**

---
### lets try to find out impact of idependent features(X) on target feature y(ETP quotidien [mm]).
---

# **Null Hypothesis:**
There is no significant impact of independent features (X) on target feature ('ETP quotidien [mm]')

# **Alternate Hypothesis:**
There is significant impact of independent features (X) on target feature ('ETP quotidien [mm]')

# **ETP quotidien [mm]** stands for "Evapotranspiration Potentielle quotidienne in millimeters" in French. 
It is a measure of the amount of water that could potentially be evaporated and transpired by a crop or vegetation in a day, assuming there is sufficient water available. ETP is an important factor in determining crop water requirements and irrigation scheduling. It is typically estimated using empirical equations that take into account factors such as temperature, humidity, wind speed, and solar radiation.

In [None]:
y=data['ETP quotidien [mm]']

In [None]:
columns=data.drop(['ETP quotidien [mm]'],axis=1).columns

In [None]:
in_significant_feat=[]

### **Applying T-Test**

In [None]:
import scipy.stats as stats

for col in columns:
    t_stat, p_value = stats.ttest_ind(data[col],y)
    if p_value < 0.05:
        print("Reject null hypothesis. There is a significant impact of "+col+" on "+"ETP quotidien [mm]")
    else:
        print("Fail to reject null hypothesis. There is no significant impact of "+col+" on "+"ETP quotidien [mm]")
        in_significant_feat.append(col)

# **Drop Insignificant Feautures**

In [None]:
in_significant_feat
# there is no insignificant input feature to drop

# **Separating Dependent and Independent Feature**

In [None]:
X=data.drop(['ETP quotidien [mm]'],axis=1)
y=data['ETP quotidien [mm]']

# **Data Scalling by MinMax Scaler**

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X= scaler.fit_transform(X)

# **Note!**
Fortunately or unfortunately there is not categorical features present in our selected data.
This data is selected because it has mis-typed data that needs sufficient problem solving 
and data handling skills.

# **Model Preparation**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3) #splitting data into training and testin

---
## **Applying Machine Learning Models**
---

In [None]:
# Applying Multiple Models
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
#k-fold cross validation
from sklearn.model_selection import KFold, cross_val_score


# Importing evaluation modules
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# check the performance on diffrent regressor
models = []
models.append(('LinearRegression', LinearRegression()))
models.append(('KNeighborsRegressor', KNeighborsRegressor()))
models.append(('SVMRegressor',SVR(C=1.0, epsilon=0.2)))

# prepare the cross-validation procedure
cv = KFold(n_splits=5, random_state=1, shuffle=True)

train_l = []
test_l = []
mae_l = []
rmse_l = []
r2_l = []

import time
i = 0
for name,model in models:
    i = i+1
    start_time = time.time()
    
    # Fitting model to the Training set
    regressor = model
    regressor.fit(X_train, y_train)
    
    # Scores of model
    train = model.score(X_train, y_train)
    test = model.score(X_test, y_test)
    
    train_l.append(train)
    test_l.append(test)
    
    # predict values
    predictions = regressor.predict(X_test)
    # RMSE
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    rmse_l.append(rmse)
    # MAE
    mae = mean_absolute_error(y_test,predictions)
    mae_l.append(mae)
    # R2 score
    r2 = r2_score(y_test,predictions)
    r2_l.append(r2)
    
    # evaluate model
    scores = cross_val_score(regressor, X, y, cv=cv, n_jobs=-1)
    
    


    print("+","="*100,"+")
    print('\033[1m' + f"\t\t\t{i}-For {name} The Performance result is: " + '\033[0m')
    print("+","="*100,"+")
    print('Root mean squared error (RMSE) : ', rmse)   
    print("-"*50)
    print('Mean absolute error (MAE) : ', mae)
    print("-"*50)
    print('R2 score : ', r2)
    print("-"*50)
    print('Avg Cross Validation Score : ', sum(scores)/len(scores))
    print("-"*50)
    print("\t\t\t\t\t\t\t-----------------------------------------------------------")
    print(f"\t\t\t\t\t\t\t Time for detection ({name}) : {round((time.time() - start_time), 3)} seconds...")
    print("\t\t\t\t\t\t\t-----------------------------------------------------------")
    print()
    
comp = pd.DataFrame({"Training Score": train_l, "Testing Score": test_l, "MAE": mae_l, "RMSE": rmse_l, "R2 Score": r2_l})
comp