# Getting started

- Colab notebooks consist of text cells (like this one) and program code cells, like the one shown below.  Code cells are executed by typing the Cmd+Enter keys (or Ctrl-Enter). You can also execute a code cell by mousing over the `[ ]` symbol in the upper left hand side of the code cell---when you hover over it it will turn into a "play" button, and clicking the play button will execute the code cell. You can find other options for executing groups of cells in the "Runtime" menu above.
- Start by executing the code cell below (the one that begins with the line `import pandas as pd`).  This loads ("imports") the required software modules that will be used in the project.



## Basics of Python
Like other programming languages, Python includes variables and functions.

- **Variable** : a reserved memory location to store values.
Simply, it's like a container that holds data that can be changed later in the program. For example to create a variable named `number` and assign its value as `100`:

```
    number = 100
```

This variable can be modified at any time.
```
    number = 100
    number = 1
```

The value of `number` has changed to 1.

- **Function** : a block of code which only runs when it is called. Functions are defined using the `def` keyword. Functions can take user-provided input values, called **arguments**.

For example, let's define an `absolute_value` function as below, which takes one argument, the number for which the absolute value should be calculated.
```
def absolute_value(num):
    if num >= 0:
        return num
    else:
        return -num
```
The output of `absolute_value(2)` is 2, and `absolute_value(-4)` is 4.

## Loading Python Libraries

In [None]:
%%capture
import pandas as pd
import numpy as np

# for normalization
from sklearn import preprocessing

# for visualization
import matplotlib.pyplot as plt
import plotly.graph_objects as go

# for Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# for data imbalance, SMOTE
from imblearn.over_sampling import SMOTE
from scipy import stats

# to calculate the performance of the models
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score

Take a moment to look at this code block:
- `import` loads a module
- `import ... as` allows you to assign a short alias to the module
- `from ... import` loads a small portion of a module
- observe that the `import`, `as` and `from` keywords are color coded purple.  
- `#` indicates a comment (observe that all of the text following the `#` is color coded green).  This text is not interpreted by the computer, and its goal is to provide the human with some information about what is happening.  

What do each of these program modules do?  You can think of them as being like a library of books that accomplish program tasks.  In general, they can be quite complicated.  In most cases, you will never learn all of the functionality of a module, and will have to use the documentation to help you determine the relevant parts for solving your problem.  It is useful to have a general sense of the types of tasks that each of modules do, so that you can find the appropriate functionality.

- [pandas](http://pandas.pydata.org) is a library for handling datasets
- [numpy](https://numpy.org/) and [scipy](https://www.scipy.org/) are libraries for mathematical and scientific computing
- [matplotlib](https://matplotlib.org/) and [plotly](https://plotly.com/python/) are libraries for data visualization
- [sklearn](https://scikit-learn.org/stable/) and [imblearn](https://pypi.org/project/imblearn/) are libraries for machine learning

## Installing RDKit Module
- To look at the molecule structure, we will use the `RDKit` [module](https://www.rdkit.org/)
- The two code blocks below will install RDKit in Google Colab


In [None]:
import sys
!time pip install rdkit-pypi

Collecting rdkit-pypi
  Downloading rdkit_pypi-2021.9.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.6 MB)
[K     |████████████████████████████████| 20.6 MB 54.9 MB/s 
Installing collected packages: rdkit-pypi
Successfully installed rdkit-pypi-2021.9.4

real	0m12.394s
user	0m7.092s
sys	0m0.960s


In [None]:
try:
  from rdkit import Chem
  from rdkit.Chem import Draw
  from rdkit.Chem.Draw import IPythonConsole
except ImportError:
  print('Stopping RUNTIME. Colaboratory will restart automatically. Please run again.')
  exit()

# Get Data
Now let's load in the training and test datasets, which are stored on GitHub. To do this, we will need to use a **module**.
Using a built-in method of a module is carried out by writing: `[module_name].[method]`

For example, to use the `read_csv` method of the `pandas` module: `pd.read_csv()`.  With this method, CSV files are read as a **DataFrame** structure, which is similar to a table. For more on DataFrames:
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
- [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) documentation



In [None]:
# load the training data and save it in the variable "train"
train=pd.read_csv('https://github.com/Iris-Agape/WiDS_23/blob/main/Data_practice/train.csv',index_col=0)
# load the test data and save it in the variable "test"
test=pd.read_csv('https://github.com/Iris-Agape/WiDS_23/blob/main/Data_practice/test.csv',index_col=0)

Let's see what these data look like. You can display the current contents of a variable by entering its name and executing the cell:


In [None]:
# display the contents of the variable "train"
train

Unnamed: 0,SMILES,name,label,500,502,504,506,508,510,512,514,516,518,520,522,524,526,528,530,532,534,536,538,540,542,544,546,548,550,552,554,556,558,560,562,564,566,568,570,572,...,3922,3924,3926,3928,3930,3932,3934,3936,3938,3940,3942,3944,3946,3948,3950,3952,3954,3956,3958,3960,3962,3964,3966,3968,3970,3972,3974,3976,3978,3980,3982,3984,3986,3988,3990,3992,3994,3996,3998,4000
0,COC1OCCO1,2-methoxy-13-dioxolane,0,0.000051,0.000051,0.000052,0.000052,0.000053,0.000053,0.000054,0.000054,0.000055,0.000056,0.000057,0.000058,0.000058,0.000059,0.000060,0.000062,0.000063,0.000064,0.000065,0.000066,0.000068,0.000069,0.000070,0.000071,0.000072,0.000073,0.000074,0.000075,0.000076,0.000077,0.000077,0.000077,0.000078,0.000078,0.000077,0.000077,0.000077,...,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000005,0.000004,0.000004,0.000004,0.000004,0.000004,0.000004,0.000004,0.000004,0.000004
1,CCCCCC=O,hexanal,1,0.000121,0.000126,0.000131,0.000136,0.000141,0.000146,0.000151,0.000155,0.000159,0.000163,0.000166,0.000169,0.000171,0.000172,0.000172,0.000172,0.000171,0.000169,0.000166,0.000163,0.000159,0.000155,0.000150,0.000146,0.000141,0.000135,0.000130,0.000125,0.000120,0.000115,0.000111,0.000106,0.000102,0.000097,0.000093,0.000089,0.000086,...,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000009,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008
2,CC1CCC(C)C1C,1R2R3S-123-trimethylcyclopentane,0,0.000013,0.000013,0.000013,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000011,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000013,0.000013,...,0.000015,0.000015,0.000015,0.000015,0.000015,0.000015,0.000015,0.000015,0.000015,0.000015,0.000015,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000014,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013
3,c1cn[se]c1,12-selenazole,0,0.000243,0.000244,0.000245,0.000246,0.000248,0.000249,0.000251,0.000254,0.000256,0.000259,0.000262,0.000266,0.000270,0.000274,0.000278,0.000283,0.000288,0.000294,0.000300,0.000306,0.000313,0.000320,0.000328,0.000336,0.000344,0.000353,0.000363,0.000372,0.000383,0.000394,0.000405,0.000416,0.000429,0.000441,0.000454,0.000467,0.000480,...,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003,0.000003
4,CCC(C)=CC(C)C,3E-24-dimethylhex-3-ene,0,0.000053,0.000053,0.000053,0.000054,0.000054,0.000054,0.000054,0.000055,0.000055,0.000056,0.000056,0.000057,0.000058,0.000058,0.000059,0.000060,0.000062,0.000063,0.000064,0.000065,0.000067,0.000068,0.000069,0.000071,0.000072,0.000073,0.000074,0.000075,0.000075,0.000076,0.000076,0.000076,0.000075,0.000075,0.000074,0.000073,0.000071,...,0.000014,0.000014,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000013,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012,0.000012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2099,c1c[nH]cn1,imidazole,0,0.001059,0.001111,0.001166,0.001222,0.001282,0.001343,0.001406,0.001470,0.001536,0.001601,0.001667,0.001730,0.001792,0.001850,0.001904,0.001952,0.001993,0.002027,0.002052,0.002068,0.002074,0.002070,0.002056,0.002034,0.002002,0.001964,0.001918,0.001868,0.001813,0.001755,0.001695,0.001634,0.001573,0.001512,0.001452,0.001394,0.001338,...,0.000027,0.000027,0.000027,0.000026,0.000026,0.000025,0.000025,0.000025,0.000024,0.000024,0.000024,0.000023,0.000023,0.000023,0.000023,0.000022,0.000022,0.000022,0.000022,0.000021,0.000021,0.000021,0.000021,0.000020,0.000020,0.000020,0.000020,0.000019,0.000019,0.000019,0.000019,0.000018,0.000018,0.000018,0.000018,0.000018,0.000018,0.000017,0.000017,0.000017
2100,C=C[Si](C)(Cl)Cl,methyl-vinyl-dichlorosilane,0,0.001328,0.001328,0.001327,0.001326,0.001324,0.001323,0.001322,0.001322,0.001324,0.001326,0.001330,0.001336,0.001344,0.001355,0.001368,0.001384,0.001404,0.001426,0.001451,0.001480,0.001512,0.001547,0.001585,0.001625,0.001668,0.001713,0.001759,0.001806,0.001853,0.001899,0.001944,0.001986,0.002024,0.002057,0.002085,0.002106,0.002120,...,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002
2101,CCC=CCl,1E-1-chloro-1-butene,0,0.000055,0.000055,0.000055,0.000055,0.000055,0.000055,0.000055,0.000056,0.000056,0.000056,0.000056,0.000056,0.000056,0.000057,0.000057,0.000057,0.000058,0.000058,0.000058,0.000059,0.000059,0.000060,0.000060,0.000061,0.000061,0.000062,0.000062,0.000063,0.000063,0.000064,0.000065,0.000065,0.000066,0.000067,0.000067,0.000068,0.000069,...,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000008,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007,0.000007
2102,FC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)F,perfluoro-n-pentane,0,0.000111,0.000114,0.000117,0.000121,0.000124,0.000128,0.000131,0.000134,0.000138,0.000141,0.000143,0.000146,0.000149,0.000151,0.000153,0.000154,0.000156,0.000157,0.000158,0.000158,0.000159,0.000159,0.000160,0.000160,0.000160,0.000161,0.000161,0.000162,0.000163,0.000164,0.000165,0.000166,0.000168,0.000169,0.000171,0.000173,0.000175,...,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002,0.000002


* Each row contains data for a different molecule
* The numbers to the left the first column (**0, 1, ...**) represent the index of each row
* The first column ("SMILES") contains the molecule SMILES string (more on that later)
* The second column ("name") contains the molecule name
* The third column ("label") contains a number indicating whether the molecule does (**1**) or does not (**0**) contain a carbonyl group
* The numbers at the top of the remaining columns (**500, 502, ..., 3998, 4000**) represent the vibrational frequency in wavenumbers, and the numbers below each frequency represent the vibrational intensity of each molecule at that frequency

We say that the vibrational intensity at each frequency is an **attribute** or **feature**. These terms refer to a property that can take on different values for different members of the dataset.

## Data Selection with Pandas
We will often need to access the values stored in particular positions in a variable. We can do this using the indices corresponding to that position:
- `iloc[row index, column index] `is used for position based data selection
- `:` is used for selecting a range of index values
- Note that in Python, index values start from `0` instead of `1`

For example:
- `iloc[1:3,0]` : select row indices 1 to 2 (i.e., second and third rows) and the first column
- `iloc[:,0]` : select all rows and the first column
- `iloc[:,2:5]`: select all rows and column indices 2 to 4 (i.e., third through fifth columns)

In [None]:
# this line of code returns the first row and first column of the training data
train.iloc[0,0]

In [None]:
# this line of code returns the first three rows and first 10 columns of the training data
train.iloc[0:3,0:10]

In [None]:
# guess what the output of this line of code will be
train.iloc[0:3,0:3]

# Plotting Spectra
Before continuing, let's look at the spectra of a few molecules to see what they look like.

- For visualization: [plotly- line chart](https://plotly.com/python/line-charts/)
- You can add a trace by using
`fig.add_trace(go.Scatter(x= [Independent Variable], y=[dependent Variable] )`
- You can choose which spectra to plot by changing the index values below

Note that the index values below refer to the row numbers in the training data DataFrame. For example, `idx_notCarbonyl=1` selects the molecule in row 0 of the training data, which is hexanal. If you want to select 12-selenazole in row 3 instead, change the line of code to read `idx_notCarbonyl=3`.

In [None]:
# change the index values below to pick molecules with and without a carbonyl
idx_hasCarbonyl=1
idx_notCarbonyl=0
# get the data for the two molecules
hasCarbonyl=train.set_index('name').iloc[idx_hasCarbonyl,3:]
notCarbonyl=train.set_index('name').iloc[idx_notCarbonyl,3:]
# plot the spectra
fig = go.Figure()
fig.add_trace(go.Scatter(x=hasCarbonyl.index, y=hasCarbonyl, name=hasCarbonyl.name,mode='markers'))
fig.add_trace(go.Scatter(x=notCarbonyl.index, y=notCarbonyl,name=notCarbonyl.name,mode='markers'))
fig.update_layout(title='Intensities over frequency',title_x=0.5)

Notice that the spectra span the same frequency range, but the maximum intensity value is different for each molecule.
