<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/SP25_Assigment_01_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Assignment_01: Neural Network Analysis of Tabular Datasets**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)



# **The Purpose of Assignments**

In this course, **Assignments** are designed to help me (and you) assess your ability to transfer knowledge gained in completing class coding exercises to solving more realistic problems.

Assignments play a pivotal role in reinforcing your learning, as they require you to apply theoretical concepts to practical scenarios. This helps solidify your understanding and enhances your problem-solving skills. By tackling these assignments independently, you develop critical thinking and the ability to synthesize information from various sources. Moreover, assignments encourage you to explore topics more deeply, fostering intellectual curiosity and promoting a deeper engagement with the subject matter. Ultimately, these assignments are not just a measure of your learning, but a means to equip you with the skills needed for real-world applications and future challenges.

## **MAKE A COPY OF THIS NOTBOOK!**

For your assignment to be graded, you **must** make a copy of this Colab notebook in your GDrive and you **must** use this copy as your worksheet.

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# YOU MUST RUN THIS CELL FIRST

try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this assignment.")
    COLAB = False

Your GMAIL address **must** appear in the output in order for your work to be graded.

### Define functions

The cell below creates one (or more) functions that are needed for this assignment. If you don't run this cell, you will receive errors later when you try to run some cells.

In [None]:
# Create functions for this lesson

def list_float_columns(dataframe):
    """
    Create a list of all columns in a DataFrame that contain float values.

    Parameters:
    dataframe (pd.DataFrame): The DataFrame to check.

    Returns:
    list: A list of column names that contain float values.
    """
    float_columns = [col for col in dataframe.columns if dataframe[col].dtype == 'float64']
    return float_columns

# **Assigment 1: Regression**

**Assignment_01** is specifically designed to assess your ability to write the Python/Tensorflow/Keras code necessary to build neural networks that can perform binary classification, multiclass classification or regression on tabular data. Based on the 1st digit in your myUTSA ID ('abc123'), you have been assigned to perform **regression**.

Unlike your class lessons, you will **not** be given examples that you can use to simply copy-and-paste code. Rather, you will be given a problem to solve and it will be up to you to use code snippets that you have been given previously to solve different aspects of this assignment. And unlike your class lessons, your will **not** be given the correct output. In other words, this assignment is basically how you would solve an actual biomedical problem.


# **Regression by Neural Networks**

**Regression** is a type of supervised learning used for predicting a continuous target variable. Unlike classification, which predicts discrete labels (e.g., cat vs. dog), regression models aim to predict a continuous outcome (e.g., house prices, stock prices, or temperature).

### **Performing Regression with Tabular Data Using Neural Networks:**
Here’s a step-by-step guide on how to perform regression using neural networks:

### **Data Preparation:**

- **Collect Data:** Obtain a dataset with numerical features and a continuous target variable.

- **Clean Data:** Handle missing values, outliers, and erroneous entries.

- **Data Normalization:** Normalize your data (e.g. convert to Z-scores) to help the neural network learn more efficiently.

- **Data Pre-Processing:** Create X- and Y-feature vectors.

- **Split Data:** Divide your data into training, and test sets

### **Neural Network Model**

- **Build the Neural Network Model:** Use TensorFlow and Keras to define the neural network architecture.

- **Train the Model:** Fit the model to your training data, using the validation set to monitor performance.

- **Evaluate the Model:** Assess the model’s performance on the test set.

# **Your Dataset for Assignmment_01**

The **_first_**  digit in your myUTSA ID (e.g. "abc123") will determine which dataset you are to analyze for this assignment and which type of neural network (i.e. classification or regression) you will need to construct. For example, if your myUTSA ID was **vue682**, then your first digit is the number `6`.

**---WARNING------WARNING------WARNING------WARNING------WARNING------WARNING---**

You are **not** free to choose any dataset for this assignment. If analyze the wrong dataset, your assignment will **NOT BE GRADED**. If you are uncertain which dataset you should be working on, contact your Instructor for help. Remember, your score in this assignment will have a large impact on your course grade so please be careful.


| First Digit myUTSA ID    | Dataset to Analyze      | Neural Network Type
--------------------------|-------------------------|-----------------
0                         | Hepatitis               | Binary Classification
1                         | Coimbra Breast Cancer   | Binary Classification
2                         | Parkinson Speech        | Binary Classification
3                         | Indian Liver            | Binary Classification
4                         | Thyroid Replacement     | Multiclass Classification
5                         | Wine Quality            | Multiclass Classification
6                         | Liver Disease           | Multiclass Classification
7                         | Bone Marrow Transplant  | Regression
8                         | German Breast Cancer    | Regression
9                         | Diabetes Progression    | Regression

## **Descriptions of Data Sets for Regression**

This section describes the various datasets, information for downloading them, and what variable(s) your network should predict. Remember, you do **not** earn and credit if you analyze the wrong dataset. Pay particular attention to the **variable for Regression** for your assigned dataset. You will need to know the name of this feature when you are constructing yor `X-` and `Y-feature` vectors.

---------------------------------

## **Bone Marrow Transplant - 1st myUTSA Digit = 7**

#### **Filename:** `bone_marrow_transplant.csv`
#### **Response Variable for Regression (Y):** `survival_time`

**Description:** The dataset describes pediatric patients with several hematologic diseases: malignant disorders (e.g. patients with acute lymphoblastic leukemia, with acute myelogenous leukemia, with chronic myelogenous leukemia, with myelodysplastic syndrome) and nonmalignant cases (i.a. patients with severe aplastic anemia, with Fanconi anemia, with X-linked adrenoleukodystrophy).

All patients were subject to the unmanipulated allogeneic unrelated donor hematopoietic stem cell transplantation.

The motivation of this study was to identify the most important factors influencing the success or failure of the transplantation procedure. In particular, verification of the research hypothesis that increased dosage of CD34+ cells / kg extends overall **survival time** without simultaneous occurrence of undesirable events affecting patients' quality of life.

* **Instances:** The set contains 187 examples characterized by 37 attributes.
* **Source:** UCI Machine Learning Repository

The meaning of the following features is as follows:
- **Recipientgender:** Male, Female
- **Stemcellsource:** Source of hematopoietic stem cells (Peripheral blood - 1	 Bone marrow - 0)
- **Donorage:** Age of the donor at the time of hematopoietic stem cells apheresis
- **Donorage35:** - Donor age <35 - 0	 Donor age >=35 - 1
- **IIIV:** - Development of acute graft versus host disease stage II or III or IV (Yes - 1	 No - 0)
- **Gendermatch:** Compatibility of the donor and recipient according to their gender (Female to Male - 1	 Other - 0)
- **DonorABO:** ABO blood group of the donor of hematopoietic stem cells (0 - 0	1	 A	 B=-1	 AB=2)
- **RecipientABO:** ABO blood group of the recipient of hematopoietic stem cells (0 - 0	1	 A	 B=-1	 AB=2)
- **RecipientRh:** Presence of the Rh factor on recipientï¿½s red blood cells ('+' - 1	 '-' - 0)
- **ABOMatch:** Compatibility of the donor and the recipient of hematopoietic stem cells according to ABO blood group (matched - 1	 mismatched - 1)
- **CMVstatus:** Serological compatibility of the donor and the recipient of hematopoietic stem cells according to cytomegalovirus
infection prior to transplantation (the higher the value the lower the compatibility)
- **RecipientCMV:** Presence of cytomegalovirus infection in the donor of hematopoietic stem cells prior to transplantation (presence - 1	 absence - 0)
- **Disease:** Type of disease (ALL	AML	chronic	nonmalignant lymphoma)
- **Riskgroup:** High risk - 1	 Low risk - 0
- **Txpostrelapse:** The second bone marrow transplantation after relapse (No - 0; Yes - 1)
- **Diseasegroup:** Type of disease (malignant - 1	 nonmalignant - 0)
- **HLAmatch:** Compatibility of antigens of the main histocompatibility complex of the donor and the recipient of hematopoietic stem cells					according to ALL international BFM SCT 2008 criteria
- **HLAmismatch:** - HLA matched - 0	 HL mismatched - 1
- **Antigen:** In how many anigens there is difference beetwen the donor nad the recipient (-1 - no differences	 0 - one difference	1 (2) - two (three) diffences)
- **Allel:** In how many allele there is difference beetwen the donor nad the recipient {-1 no differences	0 - one difference	 1 (2) (3) - two	 (tree	 four) differences)
- **HLAgrI:** The differecne type beetwien the donor and the recipient (HLA mateched - 0	the difference is in only one antigen - 1, the difference is only in one allel - 2, the difference is only in DRB1 cell - 3, two differences (two allele or two antignes) - 4, two differences (two allele or two antignes) - 5)
- **Recipientage:** Age of the recipient of hematopoietic stem cells at the time of transplantation
- **Recipientage10:**  Recipient age <10 - 0	 Recipient age>=10 - 1
- **Recipientageint:** Recipient age in (0	5] - 0	 (5	 10] - 1	 (10	 20] - 2
- **Relapse:** Reoccurrence of the disease (No - 0	 Yes - 1)
- **aGvHDIIIIV:** Development of acute graft versus host disease stage III or IV (Yes - 0	 No - 1)
- **extcGvHD:** Development of extensive chronic graft versus host disease (Yes - 0	 No - 1)
- **CD34kgx10d6:** CD34+ cell dose per kg of recipient body weight (10^6/kg)
- **CD3dCD34:** CD3+ cell to CD34+ cell ratio
- **CD3dkgx10d8:** CD3+ cell dose per kg of recipient body weight (10^8/kg)
- **Rbodymass:** Body mass of the recipient of hematopoietic stem cells at the time of transplantation
- **ANCrecovery:** Time to neutrophils recovery defined as neutrophils count >0.5 x 10^9/L
- **PLTrecovery:**  Time to platelet recovery defined as platelet count >50000/mm3
- **time_to_aGvHD_III_IV:** Time to development of acute graft versus host disease stage III or IV
- **survival_time:** Survival time in days.
- **survival_status:** Survived (Yes - 0 No - 1)

------------------------------------------

## **Breast Cancer - 1st myUTSA Digit = 8**

#### **Filename:** `breastCancer.csv`
#### **Response Variable for Regression (Y):** `time`


**Description:** The German Breast Cancer Study Group (GBSG2) dataset studies the effects of hormone treatment on recurrence-free survival time. The event of interest is the recurrence of cancer time. This data frame contains the observations of 686 women.

-	**horTh:** hormonal therapy, a factor at two levels (yes 1 and no 2).
-	**age:** age of the patients in years.
-	**menostat:** menopausal status, a factor at two levels `pre` (premenopausal) and `post` (postmenopausal).
-	**tsize:** tumor size (in mm).
-	**tgrade:** tumor grade, a ordered factor at levels 1 < 2 < 3.
- **pnodes:** number of positive nodes.
-	**progrec:** progesterone receptor (in fmol).
-	**estrec:** estrogen receptor (in fmol).
-	**time:** recurrence free survival time (in days).
-	**cens:** censoring indicator (0- censored, 1- event).

-----------------------------------------

##**Diabetes Progression Dataset - 1st myUTSA Digit = 9**

#### **Filename:** `diabetes_progression.csv`
#### **Response Variable for Regression (Y):** `disease_progression`

**Diabetes** is more than just a high blood sugar condition; it’s a complex, chronic illness that, if not managed properly, can lead to severe health complications like heart disease, nerve damage, and kidney failure. Recognizing the urgency of effective management, I embarked on a project to predict diabetes progression using advanced data science techniques. This project involved regression analysis, data visualization, and model evaluation. The complete code is available on my GitHub page, aiding better medical decision-making and patient care.

**Dataset Overview**

The Diabetes dataset, a mainstay in regression analysis, includes ten baseline variables such as age, sex, BMI, average blood pressure, and six blood serum measurements for 442 diabetes patients. The target variable is a quantitative measure of disease progression one year after the baseline.

**Data Characteristics:**
**Number of Instances:** 442
**Number of Attributes:** 10 numeric predictive values
**Target:** Quantitative measure of disease progression

**Attribute Information:**
- **age:** Age in years
- **sex:** Gender of the patient
- **bmi:** Body mass index
- **bp:** Average blood pressure
- **serum_chol:** Total serum cholesterol (mg/dL)
- **ldl:** Low-density lipoproteins (mg/dL)
- **hdl:** High-density lipoproteins (mg/dL)
- **chol_hdl_ratio:** Total cholesterol / HDL
- **log_trigly:** Log of serum triglycerides level
- **blood_glu:** Blood sugar level (mg/dL)
- **disease_progression:** Disease progression one year after baseline

-----------------------------------

# **General Instructions**

To make the assignment mpre manageable, you will given a number of specific steps to perform. For each step you will be given a specific example in a particular class lesson that you can use for a reference. For example, in **Step 1: Download and Extract Data** you are given **REF: Class_01_6 (Example 1)**. That means Example 1 in Class_01_6 provides similar code that you could use to complete that step of this assignment.

### **Variable Names**

In writing your code for this assignment, you are free to give your variables any name that makes sense to you. This includes the name of the DataFrame that holds your data. If you copy-and-paste code from earlier Class assignments, you always have to edit the name of the DataFrame to match the name you select for this assignment.

When it has been necessary to give an example name for a DataFrame in an instruction, the DataFrame has been called `dataFrameDF`. You will need to edit the name `dataFrameDF` to match the actual name you have given to your DataFrame.

### **Can I Use AI?**

You are free to use AI (e.g. Microsoft Co-Pilot) to help you complete your assignment---but you need to be very careful.

While AI can be very helpful in correcting coding errors, it can also give you code that is totally incorrect for this assignment. A small number of students in previous classes have flunked their assignment by using AI code that did not generate the correct output.

If your aren't sure what you are doing, it's much, much safer to get help with any of your coding problems from your course instructor and/or course TA's.

### **Step - 1: Download and Extract Data**

**REF: Class_01_6 (Example 1)**

As usual, a coding project starts with downloading a dataset. Since your dataset is in tabular form, you should use Pandas to read the datafile and store the information in a DataFrame. You are free to choose the name for all or your variables in this assignment, including the name of your DataFrame.

In the cell below, write the code to download your datafile from the course server and create a Pandas DataFrame to store your data. Use the function `display()` to show the data in 8 rows. For full credit, you need to show **all** of the columns in your DataFrame. You can do this by using the following code:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', 8)
~~~

Remember, you will need to edit the name `dataFrameDF` to match the name you select for your DataFrame.


In [None]:
# Step 1: Download and Extract Data



If your code is correct you should see a table with a relatively large number of columns that may extend beyond the right edge of your notebook.

## **Step 2: Describe DataFrame**

**REF: Class_01_6 (Example 3)**

The `df.describe()` command in Python is used with pandas DataFrames. It provides a summary of statistics for each column in the DataFrame. By default, it will return the count, mean, standard deviation, min, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and max values for numerical columns. It can be a handy tool for getting a quick overview of your dataset!

Use the `df.describe()` command to summaries the data in your DataFrame. Make sure to replace the `df` with the actual name of your DataFrame.

Again use these commands to set the display options:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', 8)
~~~


In [None]:
# Step 2: Describe DataFrame



If your code is correct you should see a table with a relatively large number of columns that may extend beyond the right edge of your notebook and 8 rows countaining the summary statistics for each column.

## **Step 3: Find Missing Values**

**REF: Class_01_6 (Example 4)**

In **Biostatistics**, finding and replacing missing values is crucial for several reasons:

1. **Preserving Data Integrity**: Missing values can distort the analysis, leading to biased results. By addressing missing values, you ensure that your conclusions are based on complete and accurate data.

2. **Statistical Validity**: Many statistical tests and models require complete data. Missing values can reduce the statistical power of these tests, making it difficult to detect real effects.

3. **Avoiding Data Loss**: Simply discarding rows or columns with missing values can lead to a significant loss of data, especially if the dataset is already small. Imputing missing values helps retain as much data as possible.

4. **Model Accuracy**: Machine learning models can be sensitive to missing data. Handling missing values appropriately can improve the performance and accuracy of predictive models.

5. **Consistency**: Different columns in a dataset may have varying levels of missing data. Addressing these inconsistencies helps in creating a more uniform dataset, which is easier to analyze and interpret.

The `df.isnull()` command in pandas is used to detect missing values in a DataFrame. It returns a DataFrame of the same shape as the original, but with Boolean values: `True` where the value is missing (`NaN`) and `False` where the value is not missing.

In the cell below, use the command `df.isnull()` to find and print out any missing values in your DataFrame.

To make sure you see all of the values, use this code to set your display output:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', dataFrameDF.shape[0])

~~~

In [None]:
# Step 3: Find Missing Values



## **STEP 4: Replace Missing Values**

**REF: Class_01_6 (Example 5)**

One common strategy for replacing missing values is to use the `median` of the column to replace the missing value. The `median` is used instead of the column `mean` since the `median` is a robust measure of central tendency and is less affected by outliers compared to the `mean`.

To do this:
1. Calculate the median of the column: Use the `median()` function to find the median value of the column.

2. Replace the missing values: Use the `fillna()` function to replace the missing values with the median.

If your DataFrame had one (or more) columns with missing values, you will need write the Python code to replace these missing values with the column's `median` value in the cell below.

After you have replaced the missing values, again use the `df.isnull()` command to print out the names of columns with missing values to make sure all of this missing values have been taken care of. (Just use the same code you wrote in **Step 3**.)




In [None]:
# Step 4: Replace missing values



## **Step 5: Display Non-numeric Categories**

**REF: Class_02_2 (Example 3 - Step 1)**

When building neural networks it is especially important to know which columns contain non-numeric data ("strings").

In the cell below, write the code to print out a list of columns in your DataFrame that contain non-numeric data.

In [None]:
# Step 5: Display non-numeric categories



## **Step 6: Print Names in a Categorical Column**

**REF: Class_02_2 (Example 3 - Step 3)**

In the cell below, write the code to print out a list of categorical values (strings) that are in the non-numeric column found in **Step 5**.

In [None]:
# Step 6: Print names in a categorical column



## **Step 7: Map Strings to Integer Values**

**REF: Class_02_2 (Example 3 - Step 2)**

In the cell below, write the code to map each string shown in the output above to a different integer value. To make sure your mapping worked as intended, use the `display(df)` function to display your updated DataFrame.

In [None]:
# Step 7: Map Strings to Integer Values



If your code is correct your output should be the same as Step 1 except that the strings in the non-numeric column should now be integers.

## **Step 8: Shuffle and Reindex your DataFrame**

**REF: Class_02.3 (Example 2)**

**Shuffling and reindexing** the data are important steps when building neural networks for several reasons:

1. **Preventing Overfitting**: When the data is in a specific order, the model might learn patterns that are a result of the order rather than the actual data. Shuffling helps to prevent the model from overfitting to these spurious patterns.

2. **Improving Generalization**: By shuffling the data, you ensure that the model is exposed to a wide variety of examples during training, which helps it to generalize better to new, unseen data.

3. **Reducing Bias**: Shuffling ensures that the distribution of data is more uniform across the training batches. This reduces the risk of introducing bias, especially if the data has some inherent order that might otherwise affect the model's performance.

4. **Enhancing Convergence**: Neural networks often train faster and more reliably when the data is shuffled. This is because the model updates its parameters in a more representative and balanced manner.

5. **Ensuring Robustness**: By reindexing the data, you avoid potential issues that could arise from relying on the original indices, which might have some underlying structure or grouping that could affect the training process.


In the cell below, write the code to shuffle and reindex your DataFrame. Then display your shuffled DataFrame using the following display settings:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', 8)
~~~

In [None]:
# Step 8: Shuffle and Reindex your DataFrame



If your code is correct, the output generated above by **Step 6** should look almost identical to the output you got after running **Step 7** except that this index numbers shown in the leftmore column should be in a random order.

## **Step 9: Normalize Data to the Z-values**

**REF: Class_02_2 (Example 2)**

When building neural networks, it is important to normalize the data using Z-score normalization for several reasons:

1. **Standardization**: Z-score normalization standardizes the features to have a mean of 0 and a standard deviation of 1. This ensures that all features contribute equally to the model and prevents features with larger scales from dominating the learning process.

2. **Improved Convergence**: Neural networks often use gradient-based optimization algorithms. Normalized data can lead to more stable and faster convergence of these algorithms because the gradients are more balanced and less likely to explode or vanish.

3. **Enhanced Performance**: Normalized data often leads to better model performance. By bringing all features to a similar scale, the model can learn more effectively and produce more accurate predictions.

4. **Handling Outliers**: Z-score normalization reduces the impact of outliers by transforming the data to a standard scale. Outliers will have high positive or negative Z-scores, but their influence will be mitigated compared to unnormalized data.

5. **Consistency**: Normalization ensures that different features are on the same scale, which is particularly important when combining multiple datasets or using features with different units of measurement.

To make things easier, only normalize the `float` data in your dataset by converting these numbers into their Z-values.

In the cell below, use the function `list_float_columns()` to find the names of all of the columns in your DataFrame that contain float number. (NOTE: This is custom function that was created near the beginning of the assignment). Once you know the column names, use can use the `zscore()` function to normalize the values only in the columns containing floats. Print out the names of the columns containing floats, and the first five Z-scores in each of these columns.

In [None]:
# Step 9: Normalize Data to the Z-values



**HINT:** If you see this error:

~~~text
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-9b058d8fbfea> in <cell line: 0>()
      4 from scipy.stats import zscore
      5
----> 6 float_columns = list_float_columns(dataFrameDF)
      7 print(f"Columns with float values: {float_columns}")
      8

NameError: name 'list_float_columns' is not defined
~~~~

it means you didn't run the cell called `Define functions` at the beginning of this assignment.


## **Step 10: Pre-process Data for Neural Network Training**

**REF: Class_04_3 (Example 3)**

In the cell below, write the code to preprocess the data in your DataFrame to make ready to feed into your neural network.

NOTES:

1. Since you have already converted your float values into their Z-scores, you should **not** normalize any data during your pre-processing. In other words, converting Z-scores into Z-scores, a second time, is not a good thing.

2. Basically all you need to do is write the code to generate your `X-feature vector` and your `Y-feature vector`. Your `Y-values` will be in the column that is called either `CLASS`, `Class` or 'class' depending upon which dataset you are analyzing,

3. When generating your `X-feature vector`, you should use _all_ of the columns in your DataFrame **EXCEPT** for the column containing the `Y-values`.

4. Since you will be building a **Regression** neural network, **do not** one-hot encode the Y-values when generating your `Y-feature vector`. Instead, just use the values in column containing the Y-values.

5. Do **not** your split data into training and test set yet. You will do the split later.

6. When you are done, generating both your `X-` and `Y-` feature vectors, print out the first 4 values in each vector.

In [None]:
# Step 10 - Preprocess Data for Neural Network Training



## **Step 11:  Construct and Compile Neural Network**

**REF: Class_04_2 (Example 3)**

In the cell below, use the Keras/Tensorflow libraries to split your data into `test` and `train` splits, making the test size = 0.25, and set the random state to `42`.

Then construct and compile a regression neural network with 3 hidden layers but do **not** start your training in this step. You can use the code in Example 3 in Class_04_2 as a template for your model. Since this neural network will perform regression, there should only be `1` neuron in the output layer.

After you construct your neural network, compile it but do **not** start training ("fitting") it yet.

In [None]:
# Step 11: Construct and Compile Neural Network




If your code is correct, you should **not** see any output after running the previous cell.

## **Step 12: Print Summary of Your Model**

**REF: Class_04_2**

The `model.summary()` command in deep learning frameworks like Keras and TensorFlow provides a detailed summary of the neural network model. This summary includes useful information about the model's architecture, including:

1. **Layer Names and Types:** The name and type (e.g., Dense, Conv2D, LSTM) of each layer in the model.

2. **Output Shape:** The shape of the output produced by each layer.

3. **Number of Parameters:** The total number of trainable and non-trainable parameters in each layer. This includes both the weights and biases.

4. **Model Parameters Summary:** A total count of all trainable and non-trainable parameters in the model.

In the cell below, use the `model.summary()` command to print out the information about your neural network.



In [None]:
# Step 12: Print Summary of Your Model



## **Step 13: Create Early Stopping Monitor**

**REF: Class_04_2 (Example 9)**

An **Early Stopping Monitor** is a technique used during the training of neural networks to prevent overfitting and improve the model's generalization to new, unseen data. It works by monitoring the performance of the model on a validation dataset and stopping the training process when the performance starts to degrade.

#### Here’s how it works:

1. **Monitoring Performance**: Early stopping keeps track of a specific metric, such as validation loss or validation accuracy, during each epoch of training.

2. **Patience**: It has a parameter called "patience," which defines the number of epochs to wait for an improvement in the monitored metric before stopping the training. If the performance does not improve for a specified number of epochs, the training is stopped.

3. **Restore Best Weights**: In some implementations, early stopping can also restore the model weights to the state that resulted in the best performance on the validation set.

#### Benefits of early stopping include:

- **Preventing Overfitting**: By stopping training when the model starts to overfit the training data, early stopping helps maintain good generalization performance.
- **Saving Time and Resources**: It avoids unnecessary training epochs, saving computational resources and time.

In the cell below, write the code to create an Early Stopping Monitor that monitors `val_loss`. Set the parameter `patience` to `10`.

Example 9 in Class_04_2 show the code to create an Early Stopping Monitor. Do **not** copy all of the code in this example, just the code snippet in the section called:

~~~text
# Create EarlyStopping monitor--------------------------------------------
~~~



In [None]:
# Step 11: Create early stopping monitor



If your code is correct, you should not see any output.

## **Step 14: Train the Model**

In the cell below, write the Python code to train the neural network that you constructed in **Step 9**. Set the number of epochs to `100`. Make sure the parameter `verbose` is set to `2` so that the output of each epoch is written out.

Example 9 in Class_04_2 show the code to training your model. Do **not** copy all of the code in this example, just the code snippet in the section called:

~~~text
# Train model-------------------------------------------
~~~


In [None]:
# Step 14: Train the Model



## **Step 15: Compute MSE**

**REF: Class_04_3 (Example 5)**

In the cell below, write the code to compute the Mean Squared Error (MSE) for your model.

In [None]:
# Step 15: Convert Prediction Probabilites into Actual Prediction Values



## **Step 16: Compute RMSE**

**REF: Class_04_3 (Example 6)**

In the cell below, compute the Root Mean Squared Error (RMSE) for your neural network model. Print out the first 6 values in your `Y_compare` variable as well as your percent accuracy score.

NOTE: **Step 16** uses the variable holding the actual prediction values generated in **Step 15**, so you need to run **Step 15** before your run this step.

In [None]:
# Step 15: Compute RMSE



## **Assignment Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Copy of Assignment_01.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Poly-A Tail**

## **DeepSeek**

![__](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/DeepSeek_logo.svg/1920px-DeepSeek_logo.svg.png)

**DeepSeek** (Chinese: 深度求索; pinyin: Shēndù Qiúsuǒ) is a Chinese artificial intelligence company that develops open-source large language models (LLMs). Based in Hangzhou, Zhejiang, it is owned and funded by Chinese hedge fund High-Flyer, whose co-founder, Liang Wenfeng, established the company in 2023 and serves as its CEO.

The DeepSeek-R1 model provides responses comparable to other contemporary large language models, such as OpenAI's GPT-4o and o1. It is trained at a significantly lower cost—stated at US \$6 million compared to \$100 million for OpenAI's GPT-4 in 2023—and approximately a tenth of the computing power used for Meta's comparable model, LLaMA 3.1. DeepSeek's AI models were developed amid United States sanctions on China and other countries for chips used to develop artificial intelligence, which were intended to restrict the ability of these countries to develop advanced AI systems. Lesser restrictions were later announced that would affect all but a few countries.

On 10 January 2025, DeepSeek released its first free chatbot app, based on the DeepSeek-R1 model, for iOS and Android; by 27 January, DeepSeek had surpassed ChatGPT as the most-downloaded free app on the iOS App Store in the United States, causing Nvidia's share price to drop by 18%. DeepSeek's success against larger and more established rivals has been described as "upending AI" and ushering in "a new era of AI brinkmanship". DeepSeek's compliance with Chinese government censorship policies and its data collection practices have also raised concerns over privacy and information control in the model, prompting regulatory scrutiny in multiple countries.

DeepSeek makes its generative artificial intelligence algorithms, models, and training details open-source, allowing its code to be freely available for use, modification, viewing, and designing documents for building purposes.However, reports indicate that the API version hosted in China applies content restrictions in accordance with local regulations, limiting responses on topics such as the Tiananmen Square massacre and Taiwan’s status. The company reportedly vigorously recruits young AI researchers from top Chinese universities, and hires from outside the computer science field to diversify its models' knowledge and abilities.

**Background**

In February 2016, High-Flyer was co-founded by AI enthusiast Liang Wenfeng, who had been trading since the 2007–2008 financial crisis while attending Zhejiang University. They began stock-trading with a deep learning model running on GPU on October 21, 2016. Prior to this, they used CPU-based models, mainly linear models. Most trading was done by AI by the end of 2017.

By 2019, he established High-Flyer as a hedge fund focused on developing and using AI trading algorithms. By 2021, High-Flyer exclusively used AI in trading, often using Nvidia chips. DeepSeek has made its generative artificial intelligence chatbot open source, meaning its code is freely available for use, modification, and viewing. This includes permission to access and use the source code, as well as design documents, for building purposes.

In 2021, while running High-Flyer, Liang began stockpiling Nvidia GPUs for an AI project.[20] According to 36Kr, Liang had built up a store of 10,000 Nvidia A100 GPUs, which are used to train AI, before the United States federal government imposed AI chip restrictions on China.

On 14 April 2023,[22] High-Flyer announced the start of an artificial general intelligence lab dedicated to research developing AI tools separate from High-Flyer's financial business. Incorporated on 17 July 2023, with High-Flyer as the investor and backer, the lab became its own company, DeepSeek. Venture capital firms were reluctant to provide funding, as they considered it unlikely that the venture would be able to generate an "exit" in a short period of time.

On May 16, 2023, the company Beijing DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd. incorporated under the control of Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd. As of May 2024, Liang Wenfeng held 84% of DeepSeek through two shell corporations.

After releasing DeepSeek-V2 in May 2024, which offered strong performance for a low price, DeepSeek became known as the catalyst for China's AI model price war. It was quickly dubbed the "Pinduoduo of AI", and other major tech giants such as ByteDance, Tencent, Baidu, and Alibaba began to cut the price of their AI models to compete with the company. Despite the low price charged by DeepSeek, it was profitable compared to its rivals that were losing money.

DeepSeek is focused on research and has no detailed plans for commercialization, which also allows its technology to avoid the most stringent provisions of China's AI regulations, such as requiring consumer-facing technology to comply with the government's controls on information.

DeepSeek's hiring preferences target technical abilities rather than work experience, resulting in most new hires being either recent university graduates or developers whose AI careers are less established. Likewise, the company recruits individuals without any computer science background to help its technology understand other topics and knowledge areas, including being able to generate poetry and perform well on the notoriously difficult Chinese college admissions exams (Gaokao).

**Training framework**

High-Flyer/DeepSeek has built at least two computing clusters, Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer began construction in 2019 and finished in 2020, at a cost of 200 million yuan. It contained 1,100 GPUs interconnected at a rate of 200 Gbps. It was 'retired' after 1.5 years in operation. Fire-Flyer 2 began construction in 2021 with a budget of 1 billion yuan.[18] It was reported that in 2022, Fire-Flyer 2's capacity had been utilized at over 96%, totaling 56.74 million GPU hours. Of those GPU hours, 27% was used to support scientific computing outside the company.

Fire-Flyer 2 consisted of co-designed software and hardware architecture. On the hardware side, there are more GPUs with 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for its high bisection bandwidth. On the software side, there are

* **3FS (Fire-Flyer File System):** A distributed parallel file system. It was specifically designed for asynchronous random reads from a dataset, and uses Direct I/O and RDMA Read. In contrast to standard Buffered I/O, Direct I/O does not cache data. Caching is useless for this case, since each data read is random, and would not be reused.
* **hfreduce:** Library for asynchronous communication, originally designed to replace Nvidia Collective Communication Library (NCCL).[30] It was mainly used for allreduce, especially of gradients during backpropagation. It is asynchronously run on the CPU to avoid blocking kernels on the GPU.[28] It uses two-tree broadcast like NCCL.
* **hfai.nn:** Software library of commonly used operators in neural network training, similar to torch.nn in PyTorch.
* **HaiScale Distributed Data Parallel (DDP):** Parallel training library that implements various forms of parallelism in deep learning such as Data Parallelism (DP), Pipeline Parallelism (PP), Tensor Parallelism (TP), Experts Parallelism (EP), Fully Sharded Data Parallel (FSDP) and Zero Redundancy Optimizer (ZeRO). It is similar to PyTorch DDP, which uses NCCL on the backend.
* **HAI Platform:** Various applications such as task scheduling, fault handling, and disaster recovery.
During 2022, Fire-Flyer 2 had 5000 PCIe A100 GPUs in 625 nodes, each containing 8 GPUs. At the time, they chose to exclusively use PCIe instead of DGX version of A100, since at the time the models they trained could fit within a single 40 GB GPU VRAM, so there was no need for the higher bandwidth of DGX (i.e. they required only data parallelism but not model parallelism).[30] Later, they also incorporated NVLinks and NCCL, to train larger models that required model parallelism.