## Explanation 1 - import and clear the data 

In this and the next document an explanation on how to import and clear the data and how to perform distribution fitting with Python is provided.

The *Colorado River case study* is taken as a reference. 

This first document focuses on importing and clearing the data.

To start with, we need to import all libraries we will need. 

In [1]:
import datetime

import numpy as np
import pandas as pd

Once libraries are imported, we need to load the file. For convenience, we store it into a Pandas DataFrame. 

In [2]:
# df = pd.read_csv('Colorado river.txt')

Try uncommenting the code and then run it.

As you can see, it seems that we need to add some arguments in the `pd.read_csv` line otherwise we will not be able to load the file.

If you look at the *.txt* file you can see that we can get rid of the first 49 rows (as they don't contain any data). That's what the argument `skiprows` does. At the same time we can set the first row (i.e., the index is 0) as the `header` of the columns, which represents the name of each of them.
On top of that we also need to define the delimiter between columns, through the `sep` argument, which in this case is a tab space. 

In [3]:
df = pd.read_csv('Colorado river.txt', skiprows = 49, header = 0,  sep='\t')
df

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,agency_cd,site_no,datetime,142739_00010_00003,142739_00010_00003_cd,142740_00010_00011,142740_00010_00011_cd,142741_00060_00003,142741_00060_00003_cd,142743_00095_00003,...,308125_00300_00002,308125_00300_00002_cd,314970_63680_00002,314970_63680_00002_cd,314971_63680_00003,314971_63680_00003_cd,314972_63680_00001,314972_63680_00001_cd,322444_00045_00006,322444_00045_00006_cd
0,5s,15s,20d,14n,10s,14n,10s,14n,10s,14n,...,14n,10s,14n,10s,14n,10s,14n,10s,14n,10s
1,USGS,09180500,1949-05-01,,,11.1,A,24200,A,,...,,,,,,,,,,
2,USGS,09180500,1949-05-02,,,12.8,A,21200,A,,...,,,,,,,,,,
3,USGS,09180500,1949-05-03,,,14.4,A,19700,A,,...,,,,,,,,,,
4,USGS,09180500,1949-05-04,,,13.3,A,23300,A,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27040,USGS,9180500,2023-05-12,11.4,P,,,30300,P,341.0,...,9.7,P,762.0,P,843.0,P,956.0,P,0.0,P
27041,USGS,9180500,2023-05-13,11.7,P,,,30600,P,339.0,...,9.6,P,690.0,P,763.0,P,857.0,P,0.0,P
27042,USGS,9180500,2023-05-14,11.9,P,,,32300,P,335.0,...,9.7,P,698.0,P,754.0,P,869.0,P,0.0,P
27043,USGS,9180500,2023-05-15,12.2,P,,,33700,P,329.0,...,9.6,P,648.0,P,712.0,P,784.0,P,0.0,P


We can also drop the first row (i.e., row 0) which does not contain any data.
\
The `inplace = True` argument will delete the row (or column) directly from the original DataFrame. 

In [4]:
df.drop([0], axis = 0, inplace = True)
df

Unnamed: 0,agency_cd,site_no,datetime,142739_00010_00003,142739_00010_00003_cd,142740_00010_00011,142740_00010_00011_cd,142741_00060_00003,142741_00060_00003_cd,142743_00095_00003,...,308125_00300_00002,308125_00300_00002_cd,314970_63680_00002,314970_63680_00002_cd,314971_63680_00003,314971_63680_00003_cd,314972_63680_00001,314972_63680_00001_cd,322444_00045_00006,322444_00045_00006_cd
1,USGS,09180500,1949-05-01,,,11.1,A,24200,A,,...,,,,,,,,,,
2,USGS,09180500,1949-05-02,,,12.8,A,21200,A,,...,,,,,,,,,,
3,USGS,09180500,1949-05-03,,,14.4,A,19700,A,,...,,,,,,,,,,
4,USGS,09180500,1949-05-04,,,13.3,A,23300,A,,...,,,,,,,,,,
5,USGS,09180500,1949-05-05,,,12.2,A,24600,A,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27040,USGS,9180500,2023-05-12,11.4,P,,,30300,P,341.0,...,9.7,P,762.0,P,843.0,P,956.0,P,0.0,P
27041,USGS,9180500,2023-05-13,11.7,P,,,30600,P,339.0,...,9.6,P,690.0,P,763.0,P,857.0,P,0.0,P
27042,USGS,9180500,2023-05-14,11.9,P,,,32300,P,335.0,...,9.7,P,698.0,P,754.0,P,869.0,P,0.0,P
27043,USGS,9180500,2023-05-15,12.2,P,,,33700,P,329.0,...,9.6,P,648.0,P,712.0,P,784.0,P,0.0,P


As you can see, now data have been imported and stored in a proper DataFrame.
\
However, we don't need all columns for our analysis. So we can drop some of them.

There are two ways of dropping an unnecessary column:
1) by specifying the name of the column;
2) by specifying the number of the column in the DataFrame.

Remember that `axis = 0` is for rows, while `axis = 1` is for columns.

1. Specify the name: we will drop the two columns *agency_cd* and *site_no*.

In [5]:
df.drop(['agency_cd', 'site_no'], axis = 1, inplace = True)
df

Unnamed: 0,datetime,142739_00010_00003,142739_00010_00003_cd,142740_00010_00011,142740_00010_00011_cd,142741_00060_00003,142741_00060_00003_cd,142743_00095_00003,142743_00095_00003_cd,142744_80154_00003,...,308125_00300_00002,308125_00300_00002_cd,314970_63680_00002,314970_63680_00002_cd,314971_63680_00003,314971_63680_00003_cd,314972_63680_00001,314972_63680_00001_cd,322444_00045_00006,322444_00045_00006_cd
1,1949-05-01,,,11.1,A,24200,A,,,2200,...,,,,,,,,,,
2,1949-05-02,,,12.8,A,21200,A,,,900,...,,,,,,,,,,
3,1949-05-03,,,14.4,A,19700,A,,,500,...,,,,,,,,,,
4,1949-05-04,,,13.3,A,23300,A,,,3200,...,,,,,,,,,,
5,1949-05-05,,,12.2,A,24600,A,,,8100,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27040,2023-05-12,11.4,P,,,30300,P,341.0,P,,...,9.7,P,762.0,P,843.0,P,956.0,P,0.0,P
27041,2023-05-13,11.7,P,,,30600,P,339.0,P,,...,9.6,P,690.0,P,763.0,P,857.0,P,0.0,P
27042,2023-05-14,11.9,P,,,32300,P,335.0,P,,...,9.7,P,698.0,P,754.0,P,869.0,P,0.0,P
27043,2023-05-15,12.2,P,,,33700,P,329.0,P,,...,9.6,P,648.0,P,712.0,P,784.0,P,0.0,P


2. Specify the number: we are not interested in columns 13 onwards for our analysis. 


In [6]:
df.drop(df.columns[12:], axis = 1, inplace = True)
df

Unnamed: 0,datetime,142739_00010_00003,142739_00010_00003_cd,142740_00010_00011,142740_00010_00011_cd,142741_00060_00003,142741_00060_00003_cd,142743_00095_00003,142743_00095_00003_cd,142744_80154_00003,142744_80154_00003_cd,142745_80155_00003
1,1949-05-01,,,11.1,A,24200,A,,,2200,A,144000
2,1949-05-02,,,12.8,A,21200,A,,,900,A,51500
3,1949-05-03,,,14.4,A,19700,A,,,500,A,26600
4,1949-05-04,,,13.3,A,23300,A,,,3200,A,201000
5,1949-05-05,,,12.2,A,24600,A,,,8100,A,547000
...,...,...,...,...,...,...,...,...,...,...,...,...
27040,2023-05-12,11.4,P,,,30300,P,341.0,P,,,
27041,2023-05-13,11.7,P,,,30600,P,339.0,P,,,
27042,2023-05-14,11.9,P,,,32300,P,335.0,P,,,
27043,2023-05-15,12.2,P,,,33700,P,329.0,P,,,


We keep dropping unnecessary columns.

In [7]:
df.drop(df.columns[1:3], axis = 1, inplace = True)
df

Unnamed: 0,datetime,142740_00010_00011,142740_00010_00011_cd,142741_00060_00003,142741_00060_00003_cd,142743_00095_00003,142743_00095_00003_cd,142744_80154_00003,142744_80154_00003_cd,142745_80155_00003
1,1949-05-01,11.1,A,24200,A,,,2200,A,144000
2,1949-05-02,12.8,A,21200,A,,,900,A,51500
3,1949-05-03,14.4,A,19700,A,,,500,A,26600
4,1949-05-04,13.3,A,23300,A,,,3200,A,201000
5,1949-05-05,12.2,A,24600,A,,,8100,A,547000
...,...,...,...,...,...,...,...,...,...,...
27040,2023-05-12,,,30300,P,341.0,P,,,
27041,2023-05-13,,,30600,P,339.0,P,,,
27042,2023-05-14,,,32300,P,335.0,P,,,
27043,2023-05-15,,,33700,P,329.0,P,,,


In [8]:
df.drop(df.columns[1:3], axis = 1, inplace = True)
df

Unnamed: 0,datetime,142741_00060_00003,142741_00060_00003_cd,142743_00095_00003,142743_00095_00003_cd,142744_80154_00003,142744_80154_00003_cd,142745_80155_00003
1,1949-05-01,24200,A,,,2200,A,144000
2,1949-05-02,21200,A,,,900,A,51500
3,1949-05-03,19700,A,,,500,A,26600
4,1949-05-04,23300,A,,,3200,A,201000
5,1949-05-05,24600,A,,,8100,A,547000
...,...,...,...,...,...,...,...,...
27040,2023-05-12,30300,P,341.0,P,,,
27041,2023-05-13,30600,P,339.0,P,,,
27042,2023-05-14,32300,P,335.0,P,,,
27043,2023-05-15,33700,P,329.0,P,,,


The line `drop(df.columns[1:3])` is repeated twice because once you drop those columns their place will be "taken" by the next ones (every column will shift to the left by the same amount of places of the dropped columns). 

In [9]:
df.drop(df.columns[2:7], axis = 1, inplace = True)
df

Unnamed: 0,datetime,142741_00060_00003,142745_80155_00003
1,1949-05-01,24200,144000
2,1949-05-02,21200,51500
3,1949-05-03,19700,26600
4,1949-05-04,23300,201000
5,1949-05-05,24600,547000
...,...,...,...
27040,2023-05-12,30300,
27041,2023-05-13,30600,
27042,2023-05-14,32300,
27043,2023-05-15,33700,


Now we have deleted all unnecessary columns and we are left with only those three we need (date of record, water discharge, and suspended sediment discharge respectively).

For better referencing the columns, we can rename them.

In [10]:
df.rename(columns = {'datetime': 'Date', '142741_00060_00003': r'Discharge $(m^{3} s^{-1})$',
                     '142745_80155_00003': r'Suspended sediment discharge $(ton day^{-1})$'}, inplace = True)

df

Unnamed: 0,Date,Discharge $(m^{3} s^{-1})$,Suspended sediment discharge $(ton day^{-1})$
1,1949-05-01,24200,144000
2,1949-05-02,21200,51500
3,1949-05-03,19700,26600
4,1949-05-04,23300,201000
5,1949-05-05,24600,547000
...,...,...,...
27040,2023-05-12,30300,
27041,2023-05-13,30600,
27042,2023-05-14,32300,
27043,2023-05-15,33700,


We create an array where we store the names of the columns because we will need to reference them later on.

In [11]:
col = df.columns.values

We can also change the data type of the columns. For instance, we want the first column (*Date*) to be in datetime format, while the second and third ones to be integer and float, respectively. 

We take advantage of these lines to convert US units to SI units for water and suspended sediment discharge records. 

In [12]:
df[col[1]] = df[col[1]].astype('int') * 0.0283168
df[col[2]] = df[col[2]].astype('float') * 0.91
df['Date'] = pd.to_datetime(df['Date'])
df

Unnamed: 0,Date,Discharge $(m^{3} s^{-1})$,Suspended sediment discharge $(ton day^{-1})$
1,1949-05-01,685.26656,131040.0
2,1949-05-02,600.31616,46865.0
3,1949-05-03,557.84096,24206.0
4,1949-05-04,659.78144,182910.0
5,1949-05-05,696.59328,497770.0
...,...,...,...
27040,2023-05-12,857.99904,
27041,2023-05-13,866.49408,
27042,2023-05-14,914.63264,
27043,2023-05-15,954.27616,


Now the DataFrame is ready to use!

We can also store each column into a single array.
\
Let's start with `numpy.array`.

In [13]:
date_numpy = np.array(df[col[0]])

You can also choose a different type of array, like a list for instance.

In [14]:
date_list = [df[col[0]]]

You can print the `type` of both *date* arrays to check their data type.

Can you think of any other data type? 

In [15]:
print(type(date_numpy))
print(type(date_list))

<class 'numpy.ndarray'>
<class 'list'>


We can repeat the same procedure for the water discharge (it will be done only using `numpy.array`).

In [16]:
water_discharge = np.array(df[col[1]])

_Exercise_

Get suspended sediment discharge data from the DataFrame column into a single array.

In [17]:
# use this code space for the exercise