<a href="https://colab.research.google.com/github/DavidSenseman/BIO5853/blob/master/Lesson_01_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 5853: Biostatistics**

##### **Module 1: Python and Statistics**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)


### Module 1 Material

* Part 1.1: Installing Python, Miniconda and Jupyter Lab
* Part 1.2: Introduction to Jupyterlab AI, Google CoLab
* **Part 1.3: Python Basics 1 -- Data Input**
* Part 1.4: Python Basics 2 -- Display of Statistical Data
* Part 1.5: Python Basics 3 -- Plotting in Python
* Part 1.6: Python Basics 4 -- Display of Statistical Datasets

### Lesson Setup

Run the next code cell to load necessary packages

In [None]:
# You MUST run this code cell first
import os
import shutil
import pandas as pd
import numpy as np
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

## Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [None]:
# You must run this cell second
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

## Part 1.3: Data Input

Biostatistics is fundamentally the analysis of biomedical datasets. Real biomedical datasets can often contain experimental observations  or clinical measurements from hundreds or even thousands of subjects or patients. In short, biomedical datasets are often too large for you to manually enter the data.    

In this lesson we will focus on the software package, Pandas, and on file handling. These two topics naturally go together since the Pandas package includes a number of file handling methods that are frequently used in Python programming.

## Pandas

**_Pandas_** (pronounced as "PAN-daz") is a Python package designed for data manipulation and analysis. It provides data structures and operations for manipulating numerical tables and time series. 

Pandas is built on top of the Numpy package and provides a high-level interface for working with data including data selection, cleaning, filtering, aggregation, and visualization. 

A central concept in Pandas is the **_DataFrame_**. A Pandas DataFrame is generally the most commonly used Pandas object. 

A _DataFrame_ is a two-dimensional labeled data structure with columns of potentially different data types (e.g. integers, floats, and strings). They are very similar to an Excel spreadsheet in which each **_row_** represents a single experimental subject or clinical patient and each **_column_** contains a different experimental or clinical measurement from the subject.


### The Pandas Package

Like other Python packages, Pandas has to be _imported_ into a Python program with the following command before it can be used.

`import pandas as pd`

The normal _alias_ ('nickname') for Pandas is `pd`. When using a method that is part of a Pandas package, the alias `pd` will be used instead of the package name. For example, to use the Pandas `read_csv()` method, the command would be:

`pd.read_csv(filename)`

Run the next code cell to import `pandas` which is needed for the examples and exercises below.

In [None]:
# RUN THIS CODE CELL

# Import the package Pandas
import pandas as pd

When you import a Python package _successfully_ , there is usually no output. 

If you receive an error it probably means that the package has **not** been previously installed in your current `conda` environment. 

If you need to install Pandas, uncomment the `conda install` command in the next cell and then run the cell.

In [None]:
# Uncomment the next line and run this cell ONLY if you need to install pandas

#!conda install pandas -y

## File handling

**_File handling_** in Python is the process of manipulating files and data stored in a file system. This includes reading and writing files, creating and deleting files, accessing metadata about files, and more. 

Python has a variety of built-in functions to help with file handling, such as the `open()` and `close()` functions for opening and closing files, and the `os module` for interacting with the file system. Additionally, there are several third-party libraries that can be used to simplify file handling, such as the Pandas library for working with tabular data.

There are many different types of files that you must be able to process. The most important file types are listed here:

* **CSV files:** (generally have the .csv extension) hold tabular data that resembles spreadsheet data.
* **Image files:** (generally with the .png or .jpg extension) hold images for computer vision.
* **Text files:** (often have the .txt extension) hold unstructured text and are essential for natural language processing.
* **JSONL** (often have the .json extension) contain semi-structured textual data in a human-readable text-based format.
* **H5:** (can have a wide array of extensions) contain semi-structured textual data in a human-readable text-based format. Keras and TensorFlow store neural networks as H5 files.
* **Audio Files:** (often have an extension such as .au or .wav) contain recorded sound.

Data can come from a variety of sources. 

* **Your Hard Drive -** This type of data is stored locally, and Python accesses it from a path that looks something like: c:\data\myfile.csv. On occasion, you might have to download a datfile from Canvas as part of a lesson.
* **The Internet -** This type of data resides in the cloud and Python accesses it from a URL that looks something like: https://biologicslab.co/BIO5858/data/iris.cvs
* **Google Drive (cloud) -** If your code in Google CoLab, you use GoogleDrive to save and load some data files. CoLab mounts your GoogleDrive into a path similar to the following: /content/drive/My Drive/myfile.csv.



## File format of data files

Data files can either be **_formatted_** and **_unformatted_**. For example, a Microsoft Word file (.doc or .docx) is a _formatted file_. Microsoft uses a proprietary document format to store MS Word files. If you try to "read" a formated file with a simple text editor like Notepad, you would see something unintelligiblelike this:

![__](https://biologicslab.co/BIO5853/images/MSWord.png)

Most data files used for biostatistics and machine learning are _unformatted_ text files. They can be read with any word processor program or even a simple text editor. 

~~~text
sepal_l	sepal_w	petal_l	petal_w	species
5.1	3.5	1.4	0.2	Iris-setosa
4.9	3.0	1.4	0.2	Iris-setosa
4.7	3.2	1.3	0.2	Iris-setosa
4.6	3.1	1.5	0.2	Iris-setosa
5.0	3.6	1.4	0.2	Iris-setosa
5.4	3.9	1.7	0.4	Iris-setosa
4.6	3.4	1.4	0.3	Iris-setosa
5.0	3.4	1.5	0.2	Iris-setosa
4.4	2.9	1.4	0.2	Iris-setosa
~~~

For example, the file `iris.txt` looks like this if you read it with a simple text editor:


### Example 1: Read Data File from the Internet 

The code in the cell below uses the function `pd.read_csv(filename, sep)` to read the data file `apple_quality.csv`. 

The code first creates a string variable called `URL` with the internet address of the file server. The code then creates a second string variable called `FNAME` that has the actual file name. Notice that you **must** enclose these names in quotation marks.

These two strings are the "added" together to give the complete internet address. 

As the file is read, the data is stored in a Pandas DataFrame called `df1`. 

When reading text files with `pd.read_csv()`, it is _always_ a good idea to print out part of the newly created DataFrame to make sure the file was read correctly. One way to do this is to use the Python function `display()` as shown below. Data files often contain a large number of columns and almost always a huge number of rows--too many to print to your computer screen. The code in the cell below tells Python to only display a maximum of 6 rows and 6 columns using the Python function `pd.set_option()`.


In [None]:
# Example 1: Use pd.read_csv to read a file on the hard drive 

# File location
URL="https://biologicslab.co/BIO5853/data/"

# File Name
FNAME="apple_quality.csv"

# Read local data file using Pandas read_csv() function
df1 = pd.read_csv(URL+FNAME)  # define the separator as a tab

# Set max columns and max rows
pd.set_option('display.max_columns', 6)
pd.set_option('display.max_rows', 6)
pd.set_option('display.precision', 3)

# Display 4 columns and 6 rows
display(df1)

You should see the following output:

![](https://biologicslab.co/BIO5853/images/Lesson_01_3_image_01.png)

### **Exercise 1: Read Data File from the Internet**

In the cell below, use the Pandas `pd.read_csv(filename, sep)` function to read the data file `heart_disease.csv` from the course web server and store the data in a new data frame called `df2`. Use the `display()` method and set the max rows and max columns to 6. 

In [None]:
# Insert your code for Exercise 1 here 



You should see the following output:

![](https://biologicslab.co/BIO5853/images/Lesson_01_3_image02.png)

### Example 2: Summary Statistics with Pandas `describe()` 

With any new dataset, it is generally useful to get a quick, statistical overview of dataset's contents using the Pandas `describe()` method.  

The `descibe()` method returns an variety of summary statistics about the data, including the count, mean, standard deviation, minimum, maximum, and first and third quartiles. It also includes a count of the number of non-null values, and the percent of the data that is missing. This information can be used to get a better understanding of the data and its distributions.

The code in the cell below shows how to use this method with the data stores in `df1`.

In [None]:
# Example 2: Summary Statistics with Pandas `describe()` 

# Set max columns and max rows
pd.set_option('display.max_columns', 6)
pd.set_option('display.max_rows', 6)
pd.set_option('display.precision', 3)

# Describe() method with df1
df1.describe()

You should see the following output:

![](https://biologicslab.co/BIO5853/images/Lesson_01_3_image03.png)

### **Exercise 2: Use the Pandas `describe()` method.**

In the cell below, use the Pandas `describe()` method to print the summary statistics of the data in your DataFrame `df2`.  Set the max rows and columns to 8.

In [None]:
# Insert your code for Exercise 2 here 



You should see the following output:

![](https://biologicslab.co/BIO5853/images/Lesson_01_3_image04.png)

## Dropping field in a _pandas_ DataFrame

**_Dropping a field_** in a _pandas_ DataFrame is a way of removing a column from the dataset. This can be done using the `drop()` method, which takes the label of the column that you want to remove. You may want to do this if the field is irrelevant to the analysis you are performing, or if it contains redundant information. For example, you will need to drop fields that are of no value such as an `ID` field.

### Example 3: Use `drop()` to Drop a DataFrame Column. 

The code in the cell below uses the _pandas_ `drop()` method to drop the column (field) `A_id` from the Apple Quality data set. 

To preserve the original data, the code begins by making a **_shallow copy_** of the original DataFrame using this code chunk:

~~~text
# Make a new copy of df1
df1_copy = df1.copy()
~~~



In [None]:
# Example 3: Drop a field in a data frame using drop() method 

# Make copy of df1
df1_copy = df1.copy()

# Print column names before drop
print(f"Before drop: {list(df1_copy.columns)}")

# Drop the column A_id
df1_copy.drop(columns=['A_id'], inplace=True)

# Print column names after the drop
print(f"After drop: {list(df1_copy.columns)}")

If the code is correct your should see:
~~~text
Before drop: ['A_id', 'Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness', 'Acidity', 'Quality']
After drop: ['Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Ripeness', 'Acidity', 'Quality']
~~~

### **Exercise 3: Use the Pandas `drop()` method to drop a field in a DataFrame.** 

In the cell below, make a shallow copy of your DataFrame, `df2`. Call your copy `df2_copy`. 

Use the Pandas `drop()` method to drop the `Age` column from `df2_copy`. Print the column names before and after dropping the column.

In [None]:
# Insert your code for Exercise 3 here 



If your code is correct you should see the following output:
~~~text
Before drop: ['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease']
After drop: ['Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease']
~~~

## **Lesson Turn-in**

When you have run all of the code cells in sequential order (the last code cell above should be 10) you need to create a PDF of your notebook. 

To do this, click on **File --> Print** as shown in this image.

![__](https://biologicslab.co/BIO5853/images/Lesson_01_3_image05.png)

This will bring up the Print dialog box. Click on the downward caret **v** and select **Save to PDF**.

![__](https://biologicslab.co/BIO5853/images/Lesson_01_3_image06.png)

Then click on **Save**. 

![__](https://biologicslab.co/BIO5853/images/Lesson_01_3_image07.png)

This will bring up a dialog box asking you where to save your PDF and what to name it. You should save your PDF in the Lesson_01_3 folder with the name `Lesson_01_03_lastname.pdf` where _lastname_ is your last name. 

Finally, upload the **_PDF_** to Lesson_01_3 Assignment on Canvas for grading, not the actual Jupyter Lab notebook.
