<a href="https://colab.research.google.com/github/KeyvanDiba/2022_ML_Earth_Env_Sci/blob/main/Personal_Project_Kdiba.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning - Personal Project

Description du projet : This work assesses the potential for applying machine learning techniques and algorithms to suspended sediment concentration in a proglacial river to predict sediment concentration as a function of discharge and to observe whether these two variables are correlated.

## Part I : Connection to my Drive and downloading of data

Here, the goal is to connect this notebook to my personal Drive and download my data file

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
df_1 =pd.read_excel('/content/LGS1_data.xlsx')
df_2 =pd.read_excel('/content/LGS2_data.xlsx')

In [None]:
df_1

Unnamed: 0,Time,C,SD_C,Q,SD_Q,QC,SD_QC,NoUse,NoUse2
0,244.003,0.9668,0.0068,2.6796,0.2546,2.5907,0.4839,3.0745,2.1068
1,244.007,0.9507,0.0070,2.6995,0.2548,2.5664,0.4763,3.0427,2.0901
2,244.010,0.9451,0.0070,2.6278,0.2541,2.4836,0.4722,2.9558,2.0114
3,244.014,0.9445,0.0071,2.7611,0.2554,2.6078,0.4743,3.0821,2.1335
4,244.017,0.9382,0.0071,2.7123,0.2549,2.5447,0.4703,3.0151,2.0744
...,...,...,...,...,...,...,...,...,...
5284,271.417,0.7465,0.0097,1.3525,0.2523,1.0096,0.3700,1.3796,0.6396
5285,271.424,0.0000,0.0000,1.4786,0.2496,0.0000,,,
5286,271.431,0.0000,0.0000,1.4404,0.2503,0.0000,,,
5287,271.438,0.0000,0.0000,1.3914,0.2513,0.0000,,,


In [None]:
df_2

Unnamed: 0,Time,C,SD_C,Q,SD_Q,QC,SD_QC,NoUse,NoUse2
0,246.5347,1.3922,0.0253,4.2128,0.3126,5.8650,0.8781,6.7431,4.9870
1,246.5361,1.3950,0.0254,4.2856,0.3119,5.9786,0.8791,6.8577,5.0995
2,246.5375,1.4020,0.0258,4.3692,0.3111,6.1257,0.8829,7.0086,5.2428
3,246.5389,1.4070,0.0261,4.4405,0.3104,6.2478,0.8854,7.1332,5.3624
4,246.5403,1.4105,0.0262,4.4719,0.3100,6.3075,0.8874,7.1949,5.4200
...,...,...,...,...,...,...,...,...,...
5204,271.3542,0.0000,0.0000,1.1966,0.2190,0.0000,,,
5205,271.3611,0.0000,0.0000,1.1906,0.2191,0.0000,,,
5206,271.3681,0.0000,0.0000,1.1441,0.2200,0.0000,,,
5207,271.3750,0.0000,0.0000,1.1496,0.2199,0.0000,,,


Information about the diffferent variables :

*   Time  = time, in days since 1.1.21
*   C     = suspended load concentration
*   SD_C  = standard deviation of C
*   Q     = flow rate
*   SD_Q  = standard deviation of Q
*   QC    = suspended load tranport, kg/s (QC)
*   SD_QC = standard deviation of Q
*   NoUse = column with variables not useful for this exercise

GS1 : The station is located downstream of the alluvial plain
GS2 : The station is located at the exit of the glacier, upstream of the alluvial plain

## Part II : Loading of packaging useful for the entire project

Here, the goal is to compile and load all the packages used to complete the project

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

## Part III : Data processing

The purpose of this section is to remove unnecessary collums from the dataframe as well as to remove NaN values present in the data

In [None]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5289 entries, 0 to 5288
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    5289 non-null   float64
 1   C       5289 non-null   float64
 2   SD_C    5289 non-null   float64
 3   Q       5289 non-null   float64
 4   SD_Q    5289 non-null   float64
 5   QC      5289 non-null   float64
 6   SD_QC   5239 non-null   float64
 7   NoUse   5239 non-null   float64
 8   NoUse2  5239 non-null   float64
dtypes: float64(9)
memory usage: 372.0 KB


In [None]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5209 entries, 0 to 5208
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    5209 non-null   float64
 1   C       5209 non-null   float64
 2   SD_C    5209 non-null   float64
 3   Q       5209 non-null   float64
 4   SD_Q    5209 non-null   float64
 5   QC      5209 non-null   float64
 6   SD_QC   1964 non-null   float64
 7   NoUse   1964 non-null   float64
 8   NoUse2  1964 non-null   float64
dtypes: float64(9)
memory usage: 366.4 KB


In [None]:
# Removing the two "NoUse" collums
LGS1 = df_1[['Time','C','SD_C','Q','SD_Q','QC','SD_QC']].copy()
LGS2 = df_2[['Time','C','SD_C','Q','SD_Q','QC','SD_QC']].copy()

In [None]:
LGS1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5289 entries, 0 to 5288
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    5289 non-null   float64
 1   C       5289 non-null   float64
 2   SD_C    5289 non-null   float64
 3   Q       5289 non-null   float64
 4   SD_Q    5289 non-null   float64
 5   QC      5289 non-null   float64
 6   SD_QC   5239 non-null   float64
dtypes: float64(7)
memory usage: 289.4 KB


In [None]:
LGS2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5209 entries, 0 to 5208
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    5209 non-null   float64
 1   C       5209 non-null   float64
 2   SD_C    5209 non-null   float64
 3   Q       5209 non-null   float64
 4   SD_Q    5209 non-null   float64
 5   QC      5209 non-null   float64
 6   SD_QC   1964 non-null   float64
dtypes: float64(7)
memory usage: 285.0 KB


In [None]:
# check that no raw has NaN values
LGS1.isna().values.any()
LGS2.isna().values.any()

True

In [None]:
# if true -> deleting all raw containing NaN
LGS1_noNAN = LGS1.dropna(axis=0)
LGS2_noNAN = LGS2.dropna(axis=0)

In [None]:
# check again that no raw has NaN values
LGS1_noNAN.isna().values.any()
LGS2_noNAN.isna().values.any()

False

In [None]:
LGS1_noNAN.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5239 entries, 0 to 5284
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    5239 non-null   float64
 1   C       5239 non-null   float64
 2   SD_C    5239 non-null   float64
 3   Q       5239 non-null   float64
 4   SD_Q    5239 non-null   float64
 5   QC      5239 non-null   float64
 6   SD_QC   5239 non-null   float64
dtypes: float64(7)
memory usage: 327.4 KB


In [None]:
LGS2_noNAN.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1964 entries, 0 to 1963
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    1964 non-null   float64
 1   C       1964 non-null   float64
 2   SD_C    1964 non-null   float64
 3   Q       1964 non-null   float64
 4   SD_Q    1964 non-null   float64
 5   QC      1964 non-null   float64
 6   SD_QC   1964 non-null   float64
dtypes: float64(7)
memory usage: 122.8 KB


## Part IV : Apply a Randomforest classifier on the LGS2 dataset

The aim of this section is to apply the RandomForest algorithm to the dataset from the measuring station located at outlet of the glacier (GS2)





In [None]:
Y = LGS2[['C']].copy()
X = LGS2[['Q']].copy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=2200, train_size=3000)
X_test, X_valid, y_test, y_valid = train_test_split(X_train, y_train, test_size=1500, train_size=1500)


In [None]:
print(len(X_train), len(X_test), len(X_valid))

3000 1500 1500


In [None]:
label_encoder = preprocessing.LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.fit_transform(y_test)
y_valid = label_encoder.fit_transform(y_valid)

  y = column_or_1d(y, warn=True)


In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

RandomForestClassifier()

In [None]:
rfc_valid = rfc.predict(X_valid)
print(accuracy_score(y_valid, rfc_valid))

0.6286666666666667


In [None]:
print('\nMatrix confusion for RandomForestClassifier :')
print(confusion_matrix(y_valid, rfc_valid))

acc_rfc = accuracy_score(y_valid, rfc_valid)
print(f'\nThe accuracy of the model RandomForestClassifier is {acc_rfc:.1%}')


Matrix confusion for RandomForestClassifier :
[[939   0   0 ...   0   0   0]
 [  0   1   0 ...   0   0   0]
 [  0   0   1 ...   0   0   0]
 ...
 [  0   0   0 ...   0   0   0]
 [  0   0   0 ...   0   0   0]
 [  0   0   0 ...   0   0   0]]

The accuracy of the model RandomForestClassifier is 62.9%


## Part V : Apply a Randomforest classifier on the LGS1 dataset

The aim of this section is to apply the RandomForest algorithm to the dataset from the measuring station located downstream of the alluvial plaine (GS1)

In [None]:
Y = LGS1[['C']].copy()
X = LGS1[['Q']].copy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=2200, train_size=3000)
X_test, X_valid, y_test, y_valid = train_test_split(X_train, y_train, test_size=1500, train_size=1500)

In [None]:
print(len(X_train), len(X_test), len(X_valid))

3000 1500 1500


In [None]:
label_encoder = preprocessing.LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.fit_transform(y_test)
y_valid = label_encoder.fit_transform(y_valid)

  y = column_or_1d(y, warn=True)


In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

RandomForestClassifier()

In [None]:
rfc_valid = rfc.predict(X_valid)
print(accuracy_score(y_valid, rfc_valid))

0.014


In [None]:
print('\nMatrix confusion for RandomForestClassifier :')
print(confusion_matrix(y_valid, rfc_valid))

acc_rfc = accuracy_score(y_valid, rfc_valid)
print(f'\nThe accuracy of the model RandomForestClassifier is {acc_rfc:.1%}')


Matrix confusion for RandomForestClassifier :
[[19  0  0 ...  0  0  0]
 [ 0  1  0 ...  0  0  0]
 [ 0  0  1 ...  0  0  0]
 ...
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]]

The accuracy of the model RandomForestClassifier is 1.4%
