<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_02_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 2: Machine Learning**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)


### Module 2 Material

* **Part 2.1: Pandas DataFrame Operations**
* Part 2.2: Categorical Values
* Part 2.3: Grouping, Sorting and Shuffling on Pandas  

### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
# YOU MUST RUN THIS CELL FIRST

try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

Mounted at /content/drive
Note: using Google CoLab
Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
david.senseman@gmail.com


## Datasets for Class_02_1

An important objective of this course is to introduce you to a number of different **_datasets_** that are similar to kinds of data that you might encounter as pursue your career as a biologist or as a clinical investigator.

In this class we will be introduced to two new datasets, the **_Apple Quality_** dataset, for the Examples, and the **_Obesity Prediction_** dataset, for the **Exercises**.

### **Apple Quality Dataset**

[Apple Quality Data Set](https://www.kaggle.com/datasets/nelgiriyewithana/apple-quality)

![__](https://biologicslab.co/BIO1173/images/apples.jpg)

A student trained in biology would have a good foundation for understanding the biological processes involved in the growth and development of agricultural produce. If they were specifically interested in the quality of agricultural produce, from a biological perspective, they might need to assess factors such as nutrient content, pesticide residues, presence of pathogens or contaminants, and overall health and viability of the produce. This could involve conducting laboratory tests, analyzing data, and interpreting results to determine the quality and safety of the agricultural products.

**Description:**

The Apple Quality dataset contains information about various attributes of a large sample of apples (_n_=4000), providing insights into their characteristics. The dataset includes details such as fruit ID, size, weight, sweetness, crunchiness, juiciness, ripeness, acidity, and quality.

**Key Features:**

* **A_id:** Unique identifier for each fruit
* **Size:** Size of the fruit
* **Weight:** Weight of the fruit
* **Sweetness:** Degree of sweetness of the fruit
* **Crunchiness:** Texture indicating the crunchiness of the fruit
* **Juiciness:** Level of juiciness of the fruit
* **Ripeness:** Stage of ripeness of the fruit
* **Acidity:** Acidity level of the fruit
* **Quality:** Overall quality of the fruit

### Example 1: Read Datafile and Create DataFrame

The cell below use the Pandas `read_csv()` function to read the CSV file `apple_quality.csv` located on the course HTTPS server using the following code chunk:
~~~text
aqDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/apple_quality.csv",
    na_values=['NA','?'])
~~~
The function`read_csv()` is frequently used for reading CSV files. In example, the `read_csv()` function takes 2 arguments. The first argument,
~~~text
"https://biologicslab.co/BIO1173/data/apple_quality.csv",
~~~
is a string that provides the filepath to, and filename of, the datafile.

The second argument:
~~~text
na_values=['NA','?']`
~~~
converts any missing data points in the file (`?`) into the value, `NaN` which stands for Not-a-Number.As the file is read, Pandas creates a Pandas DataFrame called `aqDF` to hold the information.

After reading the datafile into a DataFrame, it is always a good idea to use the `display()` function a print out a specified number of rows and columns of our new DataFrame to make sure the data was read correctly.

In the cell below, the `pd.set_option()` function is used to specify the number of rows and the number of columns.  Many datasets in this course will be a large number of rows and columns which can not be printed out easily to your notebook.

In [None]:
# Example 1: Read the datafile and create a Pandas DataFrame

import pandas as pd

# Read datafile and create DataFrame
aqDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/apple_quality.csv",
    na_values=['NA','?'])

# Set the display for 12 rows and 6 columns
pd.set_option('display.max_rows', 12)
pd.set_option('display.max_columns', 6)

# Display the DataFrame
display(aqDF)

Unnamed: 0,A_id,Size,Weight,...,Ripeness,Acidity,Quality
0,0,-3.970049,-2.512336,...,0.329840,-0.491590,good
1,1,-1.195217,-2.839257,...,0.867530,-0.722809,good
2,2,-0.292024,-1.351282,...,-0.038033,2.621636,bad
3,3,-0.657196,-2.271627,...,-3.413761,0.790723,good
4,4,1.364217,-1.296612,...,-1.303849,0.501984,good
...,...,...,...,...,...,...,...
3995,3995,0.059386,-1.067408,...,2.244055,0.137784,bad
3996,3996,-0.293118,1.949253,...,-1.087900,1.854235,good
3997,3997,-2.634515,-2.138247,...,4.763859,-1.334611,bad
3998,3998,-4.008004,-1.779337,...,0.214488,-2.229720,good


If your code is correct you should see the following table:

![___](https://biologicslab.co/BIO1173/images/class_02_1_Exm1.png)

From the output, we can see that `aqDF` has 4000 rows and 9 columns. Only 6 of the 9 columns are printed out. The missing columns are represented by the `...`.

In a Pandas DataFrame, each row is an _observation_, in this case a single piece of fruit (apple). The 9 columns record the various **_features_**, or factors, that were measured for each apple.

### **Obesity Prediction Dataset**

[Obesity Prediction Dataset](https://www.kaggle.com/datasets/mrsimple07/obesity-prediction)

![___](https://biologicslab.co/BIO1173/images/obesity.jpg)

**_Obesity_** is a medical condition characterized by excessive accumulation of body fat to the extent that it can have negative effects on health. It is typically defined by a body mass index (BMI) of 30 or above. Obesity increases the risk of various health problems, including heart disease, type 2 diabetes, certain types of cancer, and other chronic conditions.

Studying obesity is important for several reasons:
* **Public health impact:** Obesity is a significant public health concern worldwide, with rates of obesity on the rise in many countries. Understanding the causes, consequences, and potential interventions for obesity is crucial for developing effective public health strategies to prevent and manage this condition.
* **Health consequences:** Obesity is associated with a wide range of health problems, including cardiovascular disease, diabetes, hypertension, and certain types of cancer. Studying obesity can help researchers and healthcare professionals better understand the mechanisms by which obesity contributes to these conditions and develop targeted interventions to improve health outcomes.
* **Socioeconomic impact:** Obesity can have negative socioeconomic consequences, such as increased healthcare costs, reduced productivity, and decreased quality of life. By studying obesity, researchers can identify strategies to prevent and treat obesity that can help alleviate these economic burdens.
* **Psychological and social impact:** Obesity can also have psychological and social
  
**Features of The Obesity Prediction Dataset**

The Obesity Prediction dataset provides comprehensive information on individuals' demographic characteristics, physical attributes, and lifestyle habits, aiming to facilitate the analysis and prediction of obesity prevalence. It includes key variables such as age, gender, height, weight, body mass index (BMI), physical activity level, and obesity category.

* **Age:** The age of the individual, expressed in years.
* **Gender:** The gender of the individual, categorized as male or female.
* **Height:** The height of the individual, typically measured in centimeters or inches.
* **Weight:** The weight of the individual, typically measured in kilograms or pounds.
* **BMI:** A calculated metric derived from the individual's weight and height
* **PhysicalActivityLevel:** This variable quantifies the individual's level of physical activity
* **ObesityCategory:** Categorization of individuals based on their BMI into different obesity categories

### **Exercise 1: Read Datafile and Create DataFrame**

In the cell below, use the Pandas `read_csv()` function to read the obesity dataset and create a DataFrame called `opDF` to hold the information. You can use this code chunk:
~~~text
# Read the datafile
opDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])
~~~

Set the`display()` function to print out 8 rows and 7 columns of `opDF` before you display your new DataFrame.


In [None]:
# Insert your code for Exercise 1 here



Unnamed: 0,Age,Gender,Height,Weight,BMI,PhysicalActivityLevel,ObesityCategory
0,56,Male,173.575262,71.982051,23.891783,4,Normal weight
1,69,Male,164.127306,89.959256,33.395209,2,Obese
2,46,Female,168.072202,72.930629,25.817737,4,Overweight
3,32,Male,168.459633,84.886912,29.912247,3,Overweight
...,...,...,...,...,...,...,...
996,35,Female,165.076490,97.639771,35.830783,1,Obese
997,49,Female,156.570956,78.804284,32.146036,1,Obese
998,64,Male,164.192222,57.978115,21.505965,4,Normal weight
999,66,Female,178.537130,74.962164,23.517168,1,Normal weight


If your code is correct you should see the following table.

![___](https://biologicslab.co/BIO1173/images/class_02_1_ObesityDF.png)

As you can see, the `opDF` DataFrame contains information from 1000 patients (1 row/patient) with 7 measurements (features) for each patient.

# Part 2.1: Pandas DataFrame Operations

In this lesson we continue our investigation of the software package, Pandas, looking at some of the DataFrame operations that you will be using to prepare data so it can be used to train a neural network.  

## Dealing with Outliers

**_Outliers_** are values that are unusually high or low. We typically consider outliers to be a value that is several standard deviations from the mean. Sometimes outliers are simply errors; this is a result of [observation error](https://en.wikipedia.org/wiki/Observational_error).

Outliers that are really large or small values that may be difficult to address. The following **_function_** can remove such values. The code in the function defined below, will drop any _row_ of data that contains a selected value that is a specified number of standard deviations above, or below, the mean. In otherwords, if a certain patient has one unusually large or small clinical observation, _all_ the data from that patient will be removed.    

In [None]:
# FUNCTION TO REMOVE OUTLIERS

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean())
                          >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)

--------------------------------------

### **FUNCTIONS**

A Python **_function_** is a reusable block of code that performs a specific task. It is defined using the `def` keyword followed by the _function name_ and parentheses. Functions can take input values called arguments or parameters, which are defined inside the parentheses. These parameters can be used within the function's code block to perform computations or operations.

Functions in Python are organized, modular, and help in code reusability. They promote code readability and maintainability by dividing complex logic into smaller, manageable units. They also allow programmers to encapsulate functionality and use it repeatedly throughout the program, improving efficiency and reducing duplication.

Functions can have a return statement that provides the computed result or value back to the caller. This returned value can be stored in a variable or used directly. Functions can also be called within other functions, allowing for nested function calls. Overall, Python functions are a fundamental building block in programming that enables code organization, reusability, and abstraction.

-----------------------------------------

### Example 2: Remove Rows with Outliers

The cell below uses the `remove_outliers()` function, created above, to remove outliers in the Apple Quality dataset. In order to use this function, you need to decide which feature (i.e. `column`) that you want to test for outliers. In this example, we will focus is on the `Acidity` measurement.

Before we can apply the function to the DataFrame, we must create a vector to take care of any records in which the acidity value is missing (i.e. NaN).

A **_vector_** usually consists of numeric values, in this case the statistical [median](https://en.wikipedia.org/wiki/Median) of the acidity level in all of the apples in the sample. This is done using `median()` as follows:
~~~text
med = aqDF['Acidity'].median()
~~~

Notice that we are using square bracket indexing to select **only** the column labeled `'Acidity'`. We then use the median acidity value to "fill in" any blank spaces with the following code line:
~~~text
aqDF['Acidity'] = aqDF['Acidity'].fillna(med)
~~~

Why chose the median value to fill in the blanks? By using replacing any missing value by the column's median value, simply keeps the column's median value the same.

Having filled in any missing acidity values, we can now use our function to remove outliers:
~~~text
remove_outliers(aqDF,'Acidity',2)
~~~
The number `2` specifies the number of [standard deviations](https://en.wikipedia.org/wiki/Standard_deviation) for our upper and lower boundary. The function will remove any apple (i.e. row) from dataframe in which the acidity value that is greater than `2` standard deviations (sd) above, or below, the mean.

Finally, to see if the function worked, we print out the number of rows in the DataFrame before, and then after, applying the function.


In [None]:
# Example 2: Use remove_outlier() function with Apple Quality dataset

import numpy as np

# create vector
med = aqDF['Acidity'].median()   # calculate the median value for acidity
aqDF['Acidity'] = aqDF['Acidity'].fillna(med)  # replace an NaN with median

# Print number of rows before
print("Length before Acidity outliers dropped: {}".format(len(aqDF)))

# Apply funtion
remove_outliers(aqDF,'Acidity',2)

# Print the number of rows after
print("Length after Acidity outliers dropped: {}".format(len(aqDF)))

# Set display
pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 8)

# Display the new DataFrame
display(aqDF)

Length before Acidity outliers dropped: 3816
Length after Acidity outliers dropped: 3709


Unnamed: 0,A_id,Size,Weight,Sweetness,...,Juiciness,Ripeness,Acidity,Quality
0,0,-3.970049,-2.512336,5.346330,...,1.844900,0.329840,-0.491590,good
1,1,-1.195217,-2.839257,3.664059,...,0.853286,0.867530,-0.722809,good
...,...,...,...,...,...,...,...,...,...
3998,3998,-4.008004,-1.779337,2.366397,...,2.161435,0.214488,-2.229720,good
3999,3999,0.278540,-1.715505,0.121217,...,1.266677,-0.776571,1.599796,good


If your code is correct you should see the following table:

![__](https://biologicslab.co/BIO1173/images/class_02_1_Drop1.png)

Notice that after applying the `remove_outliers()` function, there are now `184` fewer rows (apples) in the `aqDF` DataFrame (i.e. 4000 - 3816).

Were these acidity measurements faulty which is why they were so different?

Probably not.

If you have had statistics, this is almost exactly what you would expect if acidity values were **_normally distributed_**.

According to the [**_68%-95%-99.7% Rule_**](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule), 95% of a normally distributed variable (in this case `Acidity`) should be within 2 standard deviations of the mean. If you multiple 4000 apples X 0.95, you get 3,800. In other words, you should expect about 200 of the 4,000 apples would have an acidity that was either 2 standard deviations too high, or too low. This is almost exactly what we got.

In "real life" you almost never remove any data from sample _unless_ you have very strong reasons to suspect that the data was somehow faulty. And even then, you are obliged to report what was done, and why, to anyone reading your analysis.

### **Exercise 2:  Remove Rows with outliers**

In the cell below, write the Python code to remove outliers from the `opDF` DataFrame using the `remove_outliers()` function created above. Focus on the characteristic called `BMI` that stands for Body Mass Index. Only discard patients (rows) that have a `BMI` that is more than `3` standard deviations (sd) from the mean. Make sure to fill in any missing `BMI` values with the median value for the `BMI` column.

When you are done, print out `7` columns and `5` rows of the `opDF` DataFrame.

In [None]:
# Insert your code for Exercise 2 here



Length before BMI outliers dropped: 1000
Length after BMI outliers dropped: 996


Unnamed: 0,Age,Gender,Height,Weight,BMI,PhysicalActivityLevel,ObesityCategory
0,56,Male,173.575262,71.982051,23.891783,4,Normal weight
1,69,Male,164.127306,89.959256,33.395209,2,Obese
...,...,...,...,...,...,...,...
998,64,Male,164.192222,57.978115,21.505965,4,Normal weight
999,66,Female,178.537130,74.962164,23.517168,1,Normal weight


If your code is correct you should see the following table:

![___](https://biologicslab.co/BIO1173/images/class_02_1_Drop2.png)

After applying the `remove_outliers()` function with `sd=3`, you should have removed only 4 patients, leaving 996 patients (rows) in the `opDF` dataframe.

Again, using the [**_68%-95%-99.7% Rule_**](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule), 99.7% of a normally distributed variable (in this case `BMI`) should be within 3 standard deviations of the mean. If you multiple 1000 patients X 0.997, you get 997. Again, the number of rows that were removed (4) is almost exactly what would be predicted.

--------------------------------------

## **Concatenating Rows and Columns**
Python can concatenate rows and columns together to form new data frames. In Pandas, **_concatenation_** refers to the process of combining and merging two or more DataFrames or series along a particular axis to create a new data structure. It allows us to stack or join DataFrames/series vertically or horizontally.

Pandas provides the `concat()` function to perform concatenation. By default, concatenation is done vertically along `axis 0`, resulting in a new DataFrame/series with rows appended. However, you can specify `axis=1` to concatenate horizontally, merging columns. The program does this by concatenating two columns together.

In the example below concatenation will be done along `axis 1` so the columns end up side-by-side (horizontally).

-------------------------------------

### Example 3: Concatenate Pandas columns

The cell below shows the Python code for creating a new DataFrame called `catDF` by extracting the columns `Sweetness` and `Quality` from the `aqDF` DataFrame. The code then uses the Pandas `pd.concat()` function to combine these two columns together, side-by-side, into a new DataFrame called `catDF`.

In [None]:
# Example 3: Concatenate Pandas columns

import pandas as pd

# Create a new DataFrame from Sweetness and Quality

# Extract specific columns from aqDF
col_sweet = aqDF['Sweetness']
col_qual = aqDF['Quality']

# Use Pandas concat() function to add them side-by-side
catDF = pd.concat([col_sweet, col_qual], axis=1)

# Set the display for 8 rows and all columns
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 0)

# Display new DataFrame
display(catDF)

Unnamed: 0,Sweetness,Quality
0,5.346330,good
1,3.664059,good
2,-1.738429,bad
3,1.324874,good
...,...,...
3996,-0.204020,good
3997,-2.440461,bad
3998,2.366397,good
3999,0.121217,good


If your code is correct you should see the following table:

![__](https://biologicslab.co/BIO1173/images/class_02_1_Con1.png)


### **Exercise 3: Concatenate Pandas columns**

In the cell below write the Python code to extract the columns `ObesityCategory` and `BMI` from the `opDF` DataFrame. Use the Pandas `concat()` function to combine these two columns vertically to create a new DataFrame called `resDF` (for _result_ dataframe). Use the `display()` function to print out all of the columns and 8 rows.

In [None]:
# Insert your code for Exercise 3 here



Unnamed: 0,ObesityCategory,BMI
0,Normal weight,23.891783
1,Obese,33.395209
2,Overweight,25.817737
3,Overweight,29.912247
...,...,...
996,Obese,35.830783
997,Obese,32.146036
998,Normal weight,21.505965
999,Normal weight,23.517168


If your code is correct you should see the following table:

![__](https://biologicslab.co/BIO1173/images/class_02_1_Con2.png)


### Example 4: Concatenate Pandas Rows

The **concat** function can also concatenate rows together.  The code in the cell below concatenates the first three rows and the last thress rows of the Apple Quality dataset.

The code uses square bracket indexing to specify which rows to include in a new DataFrame called `catDF`. The code `[aqDF[0:3]` specifies all of the rows, from the first row, index `0`, up to but **not** including index `3` which is the 4th row.

The code `aqDF[-3: ]` specifies _all_ of the rows starting at the 3rd row from the end, up to and including the last row. (Remember, leaving out an index means "use everything".)

The code `axis=0` specifies the operation should be done on the _rows_ not the columns. If you want your operation to work on the columns instead, you would specify `axis=1`.


In [None]:
# Example 4: Concatenate Pandas Rows

import pandas as pd

# Use Pandas concat() function
catDF = pd.concat([aqDF[0:3],aqDF[-3: ]], axis=0)

# Set the display for all rows and 6 columns
pd.set_option('display.max_rows', 0)
pd.set_option('display.max_columns', 6)

# Display new dataframe
display(catDF)

Unnamed: 0,A_id,Size,Weight,...,Ripeness,Acidity,Quality
0,0,-3.970049,-2.512336,...,0.32984,-0.49159,good
1,1,-1.195217,-2.839257,...,0.86753,-0.722809,good
2,2,-0.292024,-1.351282,...,-0.038033,2.621636,bad
3997,3997,-2.634515,-2.138247,...,4.763859,-1.334611,bad
3998,3998,-4.008004,-1.779337,...,0.214488,-2.22972,good
3999,3999,0.27854,-1.715505,...,-0.776571,1.599796,good


If your code is correct you should see the following table:

![___](https://biologicslab.co/BIO1173/images/class_02_1_Exm4.png)


### **Exercise 4: Concatenate Pandas rows**

In the cell below, use the Pandas `concat()` function to concatenate the first two and last two rows of the Obesity Prevention dataset to create a new DataFrame called `resDF`. Display _all_ of the columns, and _all_ of the rows, in `resDF`.

In [None]:
# Insert your code for Exercise 4 here



Unnamed: 0,Age,Gender,Height,Weight,BMI,PhysicalActivityLevel,ObesityCategory
0,56,Male,173.575262,71.982051,23.891783,4,Normal weight
1,69,Male,164.127306,89.959256,33.395209,2,Obese
998,64,Male,164.192222,57.978115,21.505965,4,Normal weight
999,66,Female,178.53713,74.962164,23.517168,1,Normal weight


If your code is correct you should see the following table:

![___](https://biologicslab.co/BIO1173/images/class_02_1_Res2.png)


## Training and Validation

We must evaluate a machine learning model based on its ability to predict values that it has never seen before. Because of this, we often divide the training data into a validation and training set. The machine learning model will learn from the training data but ultimately be evaluated based on the validation data.

* **Training Data** - **In Sample Data** - The data that the neural network used to train.
* **Validation Data** - **Out of Sample Data** - The data that the machine learning model is evaluated upon after it is fit to the training data.

There are two effective means of dealing with training and validation data:

* **Training/Validation Split** - The program splits the data according to some ratio between a training and validation (hold-out) set. Typical rates are 80% training and 20% validation.
* **K-Fold Cross Validation** - The program splits the data into several folds and models. Because the program creates the same number of models as folds, the program can generate out-of-sample predictions for the entire dataset.

The code below splits the data into a training and validation set. The training set uses 80% of the data, and the validation set uses 20%. Figure 2.TRN-VAL shows how we train a model on 80% of the data and then validated against the remaining 20%.

**Training and Validation**
![Training and Validation](https://biologicslab.co/BIO1173/images/class_1_train_val.png "Training and Validation")


### Example 5: Divide Data into Training and Validation Sets

The code in the cell below begins by shuffling the `aqDF` dataset using Pandas `reindex()` method combined with the Numpy `random.permutation()` function. Once the data has been shuffled, the code splits the data in the Apple Quality dataset into a _Training_ DataFrame called `aq_TrainDf` and a _Validation_ DataFrame called `aq_validationDF`.

To do this, the code first creates a _boolean mask_ called `mask` (see the next cell for a discussion of boolean masks) using this line of code:
~~~text
mask = np.random.rand(len(aqDF)) < 0.8
~~~
The code creates a `mask` that is a Numpy array containing the values either `True` or `False`. The number of `True` and `False` values is exactly the same as the number data points in the complete dataset. Approximately 80% of the values in the `mask` are `True` and the remaining are the value `False`.  

To figure out the exact number of `True` values in the `mask`, we take advantage of the fact that a `True` is equal to the number `1` while a `False` is equal to the number `0`. Therefore, to compute the number of `True` values, we can simply compute the `sum` for the `mask` as shown by this line of code:
~~~text
print(f"The number of True values in the mask: {sum(mask)}")
~~~

To create the validation DataFrame, the following line of code is used:
~~~text
apValidationDF = pd.DataFrame(aqDF[~mask])
~~~

The character `~` before the `mask` is called a _tilde_. The ~ (tilde) operator in Python is used as a unary operator to perform the bitwise inversion operation on integers. It flips the bits of an integer, changing 0s to 1s and 1s to 0s. In other words, `[~mask]` means to flip every `True` to a `False` and every `False` to a `True`. Since the `mask` is inverted, it now "selects" the roughly 20% of the complete dataset that was **_not_** placed in the Training Dataframe, to be included in the Validation Dataframe.


In [None]:
# Example 5: Split aqDF into a training DF and a validation DF

import pandas as pd
import numpy as np

# Make sure the training DF contains 80% of the data
# And the validation DF contains the remaining 20%

# Usually a good idea to shuffle
aqDF = aqDF.reindex(np.random.permutation(aqDF.index))

# Create a boolean mask that is 80% the length of the entire dataset
mask = np.random.rand(len(aqDF)) < 0.80

# Apply the mask to the whole data to produce the training DF
aqTrainDF = pd.DataFrame(aqDF[mask])

# Using the inverse of the mask to generate the validation DF
aqValidationDF = pd.DataFrame(aqDF[~mask])

# Print out the lengths of training and validation DF's
print(f"The original Apple DF: {len(aqDF)}")
print(f"The number of True values in the mask: {sum(mask)}")
print(f"Apple Training DF: {len(aqTrainDF)}")
print(f"Apple Validation DF: {len(aqValidationDF)}")

The original Apple DF: 3709
The number of True values in the mask: 2947
Apple Training DF: 2947
Apple Validation DF: 762


If your code is correct you should see something similar to the following output:
~~~text
The original Apple DF: 3816
The number of True values in the mask: 3050
Apple Training DF: 3050
Apple Validation DF: 766
~~~
Since the generation the boolean mask is a _random process_, the number of records in the `Training DF` and the `Validation DF` will vary slightly each time the code is run. However, the number `True` values in the `mask` will _always_ be the same the length of `Training DF` and the sum of `Training DF` and the `Validation DF` will _always_ be equal to the same length as the original `Apple DF`.

------------------------------------------------------------
### **MASKS**

A mask in Pandas refers to a boolean series or a boolean array that is used to filter or select specific rows from a DataFrame based on certain conditions. It acts as a filter to determine which rows should be included or excluded in subsequent operations.

The mask is a result of applying a logical condition to a dataframe, resulting in a series or an array where each element holds a boolean value indicating whether the condition is `True` or `False` for that particular row. The `mask` can be used to extract only the rows that satisfy the given condition.

For example, consider a dataframe named `DF` with columns `A` and `B`. To create a mask using a condition (e.g., select rows where column `A` is greater than `5`), you can use the following syntax:
~~~text
mask = DF['A'] > 5
~~~
This will generate a _boolean series_ in mask, where each element represents if the corresponding row in column `A` satisfies the condition (`True`) or not (`False`).

To apply this mask and obtain the filtered dataframe, `filteredDF`, containing only the rows where column `A` is greater than `5`, you can use the following syntax:
~~~text
filteredDF = DF[mask]
~~~
The resulting `filteredDF` will contain only the rows where the condition in the mask is `True`.

Masks are particularly useful for conditional filtering operations and allow for easy extraction of subsets of data based on specific criteria.

-----------------------------------------------------


### **Exercise 5: Divide Data into Training and Validation Sets**

In the cell below write the Python code to split the Obesity Prediction data into a training set called `obTrainDF` with 80% of the data and a validation set called `obValidationDF` with the remaining 20%.

Make sure to shuffle your data before you split it. Finally, print out the length of original dataset (i.e. `opDF`), the number of `True` values in the `mask`, as well as the lengths of the training and validation sets.  

In [None]:
# Insert your code for Exercise 5 here



The original Obesity Predicti0n DF: 996
The number of True values in the mask: 807
Obesity Training DF: 807
Validation DF: 189


If your code is correct you should see something similar, but not necessary identical to the following output:
~~~text
The original Obesity Predicti0n DF: 996
The number of True values in the mask: 798
Obesity Training DF: 798
Validation DF: 198
~~~
Again, the reason for the variability in the output is due to the random process used to generate the boolean `mask`.

## Converting a DataFrame to a Vector (Numpy Array)

There are several reasons why you might want to convert a Pandas DataFrame into a type of **_vector_** known as a Numpy (numeric) array.

For example:

* **Mathematical Operations:** If you need to perform mathematical operations on the data, converting the DataFrame to a numerical array can be advantageous. Numpy arrays allow for efficient numerical computations and provide a wide range of mathematical functions.

* **Integration with Machine Learning Libraries:** Many machine learning libraries, such as Scikit-learn, expect input in the form of numerical arrays. Converting your DataFrame to a numerical array allows you to seamlessly integrate and use these libraries for tasks such as classification, regression, or clustering.

* **Memory Efficiency:** Numpy arrays are more memory-efficient compared to Pandas DataFrames. If you have a large dataset and memory consumption is a concern, converting the DataFrame to a numerical array can help reduce memory usage.

* **Compatibility with Statistical Packages:** Statistical packages like `SciPy` or `Statsmodels` often work more efficiently with numerical arrays. Converting your DataFrame to a numerical array can enhance compatibility and facilitate statistical analysis or hypothesis testing.

* **Neural networks:** Neural networks do not directly operate on Python DataFrames.  A neural network requires a numeric vector.  

### Example 6: Convert a DataFrame into a Vector

For this example, we are using the complete Apple Quality dataset by re-reading the CSV datafile.

The Python code in the cell below then uses the Pandas `df.values()` method to convert all the values in a DataFrame into a vector called `aqX`.


In [None]:
# Example 6: Convert dataframe into a matrix

import pandas as pd

# Read the datafile
aqDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/apple_quality.csv",
    na_values=['NA','?'])

# Use the Pandas .values method
aqX = aqDF.values

# Print first 4 values
aqX[0:4]

array([[0, -3.970048523, -2.512336381, 5.346329613, -1.012008712,
        1.844900361, 0.329839797, -0.491590483, 'good'],
       [1, -1.195217191, -2.839256528, 3.664058758, 1.588232309,
        0.853285795, 0.867530082, -0.722809367, 'good'],
       [2, -0.292023862, -1.351281995, -1.738429162, -0.342615928,
        2.838635512, -0.038033328, 2.621636473, 'bad'],
       [3, -0.657195773, -2.271626609, 1.324873847, -0.097874716,
        3.637970491, -3.413761338, 0.790723217, 'good']], dtype=object)

If your code is correct you should see the following output:
~~~text
array([[0, -3.970048523, -2.512336381, 5.346329613, -1.012008712,
        1.844900361, 0.329839797, -0.491590483, 'good'],
       [1, -1.195217191, -2.839256528, 3.664058758, 1.588232309,
        0.853285795, 0.867530082, -0.722809367, 'good'],
       [2, -0.292023862, -1.351281995, -1.738429162, -0.342615928,
        2.838635512, -0.038033328, 2.621636473, 'bad'],
       [3, -0.657195773, -2.271626609, 1.324873847, -0.097874716,
        3.637970491, -3.413761338, 0.790723217, 'good']], dtype=object)
~~~

By inspection of our vector `aqX`, we can see that the first 8 items for each observation (i.e. apple) are numeric, but the last one is a string (word), either the word `good` or `bad`.  The Pandas `pd.method` simply creates a Numpy array from a Pandas DataFrame. If we want all the values in our vector to be numeric (which we will), it will be up to us to convert any strings to numbers, before we generate our vector.

### **Exercise 6: Convert a DataFrame into a Vector**

In the cell below re-read the Obesity Prediction dataset from the course HTTPS server to re-create your original `opDF` DataFrame.

Then use the Pandas `df.values()` method to convert values in `opDF` DataFrame into a vector called `opX`. Print out the first 4 values in  `opX`.


In [None]:
# Insert your code for Exercise 6 here



array([[56, 'Male', 173.5752624383722, 71.98205082003972,
        23.89178262396797, 4, 'Normal weight'],
       [69, 'Male', 164.1273058223382, 89.95925553264384,
        33.39520945079775, 2, 'Obese'],
       [46, 'Female', 168.0722021276139, 72.93062926527617,
        25.81773745564312, 4, 'Overweight'],
       [32, 'Male', 168.4596328403327, 84.8869124724179,
        29.912246975758787, 3, 'Overweight']], dtype=object)

If your code is correct you should see the following output:
~~~text
array([[52, 'Male', 156.86334691027685, 73.21851105060271,
        29.75623218349644, 4, 'Overweight'],
       [25, 'Male', 170.81100244040022, 58.15110962067611,
        19.930873069088832, 2, 'Normal weight'],
       [77, 'Male', 183.93159969089783, 90.50909209423124,
        26.753432621101897, 4, 'Overweight'],
       [28, 'Female', 159.3443627513497, 46.71433807622408,
        18.39826169919863, 3, 'Underweight']], dtype=object)
~~~

As above, your vector `opX` will contain a mixture of numeric and string values.

## Example 7: Convert a Subset of Columns to a Feature Vector

One operation that you will be asked to due repeatedly in the course will be to create a feature vector containing numeric values from specific columns in a DataFrame. As you will see, there are several ways that you can handle this task.

To make the code easier to understand, the process has been broken down in 3 steps.

### Example 7-Step 1: Create List of Column Names

In the first step, use Python's `list` function in combination with the `df.columns` method to create a list of all of the column names in the DataFrame. The name of the list is `aqX_columns`.  

In [None]:
# Example 7-Step 1: Create List of Column Names

# Create list
aqX_columns=list(aqDF.columns)

# Print list
aqX_columns

['A_id',
 'Size',
 'Weight',
 'Sweetness',
 'Crunchiness',
 'Juiciness',
 'Ripeness',
 'Acidity',
 'Quality']

If your code is correct, you should see the following output:
~~~text
['A_id',
 'Size',
 'Weight',
 'Sweetness',
 'Crunchiness',
 'Juiciness',
 'Ripeness',
 'Acidity',
 'Quality']
~~~
This list contains _all_ of the columns in our DataFrame.


### Example 7-Step 2: Remove Unwanted Column Names

Invariably, there will be one, or more columns that you will **not** want to include in a particular feature vector. The code in the cell below shows how to use the `remove()` method to remove a specific column name from the list. Note that you can only remove one item at a time, so if you need to remove more than one column, each column has to be removed separately.

In [None]:
# Example 7-Step 2: Remove Unwanted Column Names

# Remove each column name separately
aqX_columns.remove('A_id')
aqX_columns.remove('Quality')

# Print list
aqX_columns

['Size',
 'Weight',
 'Sweetness',
 'Crunchiness',
 'Juiciness',
 'Ripeness',
 'Acidity']

If your code is correct, you should see the following output:
~~~text
['Size',
 'Weight',
 'Sweetness',
 'Crunchiness',
 'Juiciness',
 'Ripeness',
 'Acidity']
~~~

Now our list contains only the column names we want to include in creating our feature vector.

### Example 7-Step 3: Generate Feature Vector

Finally, we just use our list column names with the `values` method to generate our feature vector called `aqX`.

In [None]:
# Example 7-Step 3: Generate Vector

import pandas as pd

# Generate X feature vector
aqX = aqDF[aqX_columns].values

# Print first 4 values
aqX[0:4]

array([[-3.97004852, -2.51233638,  5.34632961, -1.01200871,  1.84490036,
         0.3298398 , -0.49159048],
       [-1.19521719, -2.83925653,  3.66405876,  1.58823231,  0.8532858 ,
         0.86753008, -0.72280937],
       [-0.29202386, -1.35128199, -1.73842916, -0.34261593,  2.83863551,
        -0.03803333,  2.62163647],
       [-0.65719577, -2.27162661,  1.32487385, -0.09787472,  3.63797049,
        -3.41376134,  0.79072322]])

You should see the following output:
~~~text
array([[-3.97004852, -2.51233638,  5.34632961, -1.01200871,  1.84490036,
         0.3298398 , -0.49159048],
       [-1.19521719, -2.83925653,  3.66405876,  1.58823231,  0.8532858 ,
         0.86753008, -0.72280937],
       [-0.29202386, -1.35128199, -1.73842916, -0.34261593,  2.83863551,
        -0.03803333,  2.62163647],
       [-0.65719577, -2.27162661,  1.32487385, -0.09787472,  3.63797049,
        -3.41376134,  0.79072322]])
~~~

## **Exercise 7: Convert a Subset of Columns to a Feature Vector**

In the next 3 code cells, you are to repeat the steps shown in Example 7 to create a feature vector from your `opDF` DataFrame.

### **Exercise 7-Step 1: Create List of Column Names**

Create a list containing all of the column names in `opDF` and print them out. Call your list `opX_columns`.

In [None]:
# Insert your code for Exercise 7-Step 1 here



['Age',
 'Gender',
 'Height',
 'Weight',
 'BMI',
 'PhysicalActivityLevel',
 'ObesityCategory']

If your code is correct you should see the following output:
~~~text
['Age',
 'Gender',
 'Height',
 'Weight',
 'BMI',
 'PhysicalActivityLevel',
 'ObesityCategory']
~~~

### **Exericse 7-Step 2: Remove Unwanted Column Names**

In the cell below, remove the columns names `BMI`, `PhysicalActivityLevel` and `ObesityCategory` from you list. Print out your revised `opX_columns`.

In [None]:
# Insert your code for Exercise 7-Step 2 here



['Age', 'Gender', 'Height', 'Weight']

If your code is correct, you should see the following output:
~~~text
['Age', 'Gender', 'Height', 'Weight']
~~~

If, instead you see the following error message:

![___](https://biologicslab.co/BIO1173/images/class_04_4_Error.png)

**Don't Panic!**

It probably means that you re-ran the above code twice. The first time you ran it, the code removed the column names. Since the columns the second time you ran it, Python could find the column names to remove.  

Just go back and re-run **Exercise 7-Step** to recreate your `opX_columns` and then re-run Step 2 again.

### **Exercise 7-Step 3: Generate Feature Vector**

Use your column list `opX_columns` to generate a feature vector called `opX`. Print out the first four values in `opX`.

In [None]:
# Insert your code for Exercise 7-Step 3 here



array([[56, 'Male', 173.5752624383722, 71.98205082003972],
       [69, 'Male', 164.1273058223382, 89.95925553264384],
       [46, 'Female', 168.0722021276139, 72.93062926527617],
       [32, 'Male', 168.4596328403327, 84.8869124724179]], dtype=object)

If your code is correct you should see the following output:
~~~text
array([[56, 'Male', 173.5752624383722, 71.98205082003972],
       [69, 'Male', 164.1273058223382, 89.95925553264384],
       [46, 'Female', 168.0722021276139, 72.93062926527617],
       [32, 'Male', 168.4596328403327, 84.8869124724179]], dtype=object)
~~~

## Saving a DataFrame to CSV

You might want to convert a Pandas DataFrame into a CSV file in various situations, including:

* **Data Storage:** CSV (Comma-Separated Values) is a commonly used file format for storing tabular data. If you need to store your DataFrame as a standalone file, converting it to a CSV format allows for easy sharing, portability, and compatibility with other software.

* **Data Exchange:** CSV is widely recognized and supported by numerous applications and programming languages. When you want to exchange data with other systems, convert your DataFrame to a CSV file to ensure seamless interoperability and enable the recipients to access and process the data without requiring Pandas or a specific library.

* **Data Analysis:** Some statistical or data analysis software prefer CSV files as input. By converting your DataFrame to a CSV file, you can leverage these external tools or libraries for advanced analysis, visualization, or modeling.

* **Database Import:** Many databases and data storage systems accept CSV files as a means of data insertion. By converting your DataFrame to a CSV file, you can easily import the data into a database, making it more manageable, searchable, and suitable for long-term storage.

* **Data Backup:** Creating a CSV file from your DataFrame acts as a backup option, ensuring that your data remains accessible even if something happens to the original DataFrame or its environment. This can serve as a contingency plan or for version control purposes.

Converting a Pandas DataFrame into a CSV file allows you to save, exchange, analyze, and backup your data efficiently, ensuring flexibility, compatibility, and ease of use across various applications and systems.


## Example 8: Read/Write a DataFrame to a local CSV file

In Example 8, we first write our `opDF` DataFrame to a file, and then we read it back.

### Example 8A: Write DataFrame to CSV file

The code in the cell below saves (writes) the DataFrame `aqDF` to a CVS file called `AppleQualityCSV.csv`. Here we write the file to our current directory by specifying the path as a dot:
~~~text
# Specify the path
path = "."
~~~
If we wanted to, we could easily write our CSV file to a different location on our computer/laptop by writing out the pathname.

The code then joins the pathname to the filename using an "operating system" (`os.`) command:
~~~text
# Specify the file path and filename
filename_write = os.path.join(path, "AppleQualityCSV.csv")
~~~

As an experiment, we will shuffle the data in our DataFrame before we write it to our local disk drive.


In [None]:
# Example 8A: Write DataFrame to CSV file

import pandas as pd
import os

# Specify the path
path = "."

# Specify the file path and filename
filename_write = os.path.join(path, "AppleQualityCSV.csv")

# Shuffle the data before saving
aqDF = aqDF.reindex(np.random.permutation(aqDF.index))

# Specify index = false to not write row numbers
aqDF.to_csv(filename_write, index=False)

You should now see the _new_ file `AppleQualityCSV.csv` in your file browser panel.

### Example 8B: Read DataFrame from a CSV file

The code in the next cell reads the new file `AppleQualityCSV.csv` back into a new DataFrame called simply `df`. This is done to illustrate what data looks like when it is saved to a CSV file.

In [None]:
# Example 8B: Read the new CSV file

import pandas as pd
import os

# Use read_csv() function to read data and create dataframe
df = pd.read_csv("./AppleQualityCSV.csv", na_values=['NA','?'])


# Set the display for 12 rows and 6 columns
pd.set_option('display.max_rows', 12)
pd.set_option('display.max_columns', 6)

# Display 6 columns and 12 rows of the dataframe
display(df)


Unnamed: 0,A_id,Size,Weight,...,Ripeness,Acidity,Quality
0,691,-3.608322,-1.341022,...,3.357795,1.460762,bad
1,1763,2.495227,-2.059803,...,-1.102142,3.368661,good
2,2547,-0.088364,-2.100075,...,1.620981,-0.745927,good
3,3003,-0.692217,-1.029835,...,4.055425,-1.968352,bad
4,1201,2.635308,-2.320476,...,-0.885596,2.965505,good
...,...,...,...,...,...,...,...
3995,3462,-1.600437,1.715898,...,0.283398,-3.473578,good
3996,1049,1.330044,-0.790755,...,-2.079250,1.387408,bad
3997,2866,-0.035350,-0.616466,...,-1.026423,1.020470,bad
3998,784,-1.540001,-0.316826,...,-0.032854,0.351966,good


If your code is correct you should see the following table:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_1_Ex8B.png)

Notice that even though we had shuffled the data before writing it the CSV file in Example 8A, the index numbers at the left have been restored. The data is still scrambled (compared to the original CSV datafile) but Pandas writes the dataframe using sequential index numbers.

As you will see shortly, this is _not_ what will happen below when the data is written and then read back into memory using Pickel.

## **Exercise 8: Read/Write a DataFrame to a local CSV file**

### **Exercise 8A: Write DataFrame to CSV file**

In the cell below, save the DataFrame `opDF` to a CVS file called `ObesityPredictionCSV.csv`.

In [None]:
# Insert your code for Exercise 8A here



You should now see the file `ObesityPredictionCSV.csv` in your file browser panel. There is no need to read your new CSV file.

## Saving a DataFrame to Pickle

A variety of software programs can use text files stored as CSV. However, they take longer to generate and can sometimes lose small amounts of precision in the conversion. Generally, you will output to CSV because it is very compatible, even outside of Python.

Another file format is [Pickle](https://docs.python.org/3/library/pickle.html). The code below stores the DataFrame to the Pickle file format. Pickle stores data in the **_exact binary representation_** used by Python. The benefit is that there is no loss of data going to CSV format. The disadvantage is that generally, only Python programs can read Pickle files.

### Example 9: Saving a DataFrame to Pickle

The code in the cell below saves the DataFrame `apOrigDF` to a Pickel file called `AppleQualityPickel.pkl`. Before you can either read or save a Pickel file, you need to import the `pickel` module.

The line of code:
~~~text
with open(filename_write,"wb") as fp:
~~~
opens a _file pointer_ `fp`. The argument `"wb"` tells Python that you are going to Write Binary data to the file you just opened.

After writing to the file, the command `fp.close()` is used to _close the file_. Closing the file is essential to release system resources and ensure that any changes or data in the buffer are properly written to the disk. It is recommended to close the file as soon as you have finished working with it, to maintain good programming practices and avoid potential issues with file handling.

In [None]:
# Example 9: Save a DataFrame to pickle

import os
import pickle

# Specify the path
path = "."

# Specify the file path and filename
filename_write = os.path.join(path, "AppleQualityPickel.pkl")

# Shuffle the data before saving
aqDF = aqDF.reindex(np.random.permutation(aqDF.index))

# Write out the dataframe to a pickel file
with open(filename_write,"wb") as fp:
    pickle.dump(aqDF, fp)

# Close the file after writing to it
fp.close()

You should now see the file `AppleQualityPickel.pkl` in your file browser panel.

### **Exercise 9: Saving a DataFrame to Pickle**

In the cell below, save your DataFrame `opDF` to a Pickel file called `ObesityPredictionPickel.pkl`. Make sure to shuffle the data before writing to the Pickel file and to close the file after you are done writing to it.

In [None]:
# Insert your code for Exercise 9 here



You should now see the file `ObesityPredictionPickel.pkl` in your file browser panel.

## Loading a Pickel File into Memory

You might want to load a Pickle file into memory in the following scenarios:

* **Preserving Data Structure:** When you have a complex data structure, such as a nested dictionary, list of dictionaries, or custom objects, Pickle allows you to serialize and deserialize the data, preserving its original structure. Loading a Pickle file back into memory ensures that you can restore the data structure exactly as it was when it was saved.

* **Efficient Storage and Retrieval:** Pickle provides a convenient way to store large amounts of data in a compact binary format. If you have large datasets or complex objects that you need to save and retrieve efficiently, using Pickle files can offer significant advantages over other formats like CSV or JSON.

* **Python Object Serialization:** Pickle is a native Python module, specifically designed for serializing Python objects. If you have Python-specific objects or data structures that you want to save and restore without losing any information or functionality, loading a Pickle file allows you to recreate the objects exactly as they were when they were pickled.

* **Data Persistence:** Pickle can be used for persistent storage of data, allowing you to save and load data between program executions. By loading a Pickle file into memory, you can access and utilize the previously saved data, eliminating the need to recreate or recalculate it each time the program runs.

* **Easy Interoperability:** Pickle files can be shared and used across different Python environments and versions, ensuring compatibility and easy transfer of data between systems. Loading a Pickle file is a quick and efficient way to import saved data in a format that can be readily understood and processed by any environment that supports Pickle.


### Example 10: Load a Pickel file into memory

Loading the pickle file back into memory is accomplished by the following lines of code.  Here we are using Pickel's `load()` function to read the file and create a new DataFrame called `aqPickelDF`.

In [None]:
# Example 10: Load a pickel file into memory

import os
import pickle

# Specify the path
path = "."

# Specify the file path and filename
filename_read = os.path.join(path, "AppleQualityPickel.pkl")

# Open up a file pointer fp
with open(filename_write,"rb") as fp:
    aqPickelDF = pickle.load(fp)

# Close the file after reading it
fp.close()

# Set the display for 8 rows and 7 columns
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 7)

# Display the new dataframe
display(aqPickelDF)

Unnamed: 0,Age,Gender,Height,Weight,BMI,PhysicalActivityLevel,ObesityCategory
594,75,Male,166.929051,48.332240,17.344951,3,Underweight
290,49,Male,162.579803,91.178148,34.495114,4,Obese
131,18,Female,169.584299,64.800502,22.532381,3,Normal weight
118,64,Male,167.788027,76.825917,27.288887,1,Overweight
...,...,...,...,...,...,...,...
993,50,Male,177.754455,64.719166,20.482923,4,Normal weight
650,65,Female,172.105700,118.907366,40.143779,2,Obese
723,43,Male,171.108211,62.452657,21.330902,1,Normal weight
952,43,Male,153.739889,77.404803,32.748754,4,Obese


If your code if correct you should see something similar to the following table:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_1_Pickel.png)

Notice that the index numbers at the left side are still jumbled from the previous shuffle.

-----------------------------------

## **Pickel Module**

Python's **_pickle module_** provides functionality for serializing and deserializing Python objects. The `pickle.load()` method is used to deserialize and load a serialized object from a Pickle file or a file-like object.

Here are key points about the `pickle.load()` method:

* **Purpose:** The `pickle.load()` method is used to deserialize and load a Pickle object that was previously serialized with pickle.dump() or pickle.dumps().

* **Usage:** To use `pickle.load()`, you first need to open a file in binary mode using the `open()` function and pass the file object as an argument to `pickle.load()`. Alternatively, you can use a file-like object that supports the necessary read operations.

* **Deserialization:** `pickle.load()` reads the serialized object from the file or file-like object and reconstructs the original Python object in memory. It restores the object's state, including its attributes, methods, and other details.

* **Unpickling Security:** It's crucial to note that loading Pickle data can have security implications. Untrusted or malicious Pickle files can execute arbitrary code when loaded using `pickle.load()`. Only load Pickle data from trusted sources to prevent security risks.

* **Handling Different Pickle Formats:** The `pickle.load()` method can handle different Pickle format versions. It detects the Pickle format automatically and loads the object accordingly.

* **Closing the File:** After loading the Pickle object successfully, remember to close the file using the close() method of the file object or by utilizing a with statement to ensure proper resource management.

Using the `pickle.load()` method provides a convenient way to deserialize and load serialized Python objects. However, be cautious and load Pickle data only from trusted sources to avoid potential security vulnerabilities.

------------------------------------

### **Exercise 10: Load a Pickel file into memory**

In the cell below, load the pickle file `ObesityPredictionPickel.pkl` back into memory  using Pickel's `load()` function, creating a DataFrame called `opPickelDF`. Display 7 columns and 8 rows of `opPickelDF`.

In [None]:
# Insert your code for Exercise 10 here



Unnamed: 0,Age,Gender,Height,Weight,BMI,PhysicalActivityLevel,ObesityCategory
594,75,Male,166.929051,48.332240,17.344951,3,Underweight
290,49,Male,162.579803,91.178148,34.495114,4,Obese
131,18,Female,169.584299,64.800502,22.532381,3,Normal weight
118,64,Male,167.788027,76.825917,27.288887,1,Overweight
...,...,...,...,...,...,...,...
993,50,Male,177.754455,64.719166,20.482923,4,Normal weight
650,65,Female,172.105700,118.907366,40.143779,2,Obese
723,43,Male,171.108211,62.452657,21.330902,1,Normal weight
952,43,Male,153.739889,77.404803,32.748754,4,Obese


If your code is correct you should see something similar the following output:

![___](https://biologicslab.co/BIO1173/images/class_02_1_Pickel1.png).

Notice that the index numbers at the left side are still jumbled from the previous shuffle.

## **Lesson Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number 28), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Class_02_1.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.