<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Multiclass_Classification_Spring25_Assigment_01_WineQual.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Assignment 1: Convolutional Neural Networks (CNN) for Computer Vision**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)



# **The Purpose of Assignments**

In this course, **_Assignments_** are designed to help me (and you) assess your ability to transfer knowledge gained in completing class coding exercises to solving more realistic problems.

Assignments play a pivotal role in reinforcing your learning, as they require you to apply theoretical concepts to practical scenarios. This helps solidify your understanding and enhances your problem-solving skills. By tackling these assignments independently, you develop critical thinking and the ability to synthesize information from various sources. Moreover, assignments encourage you to explore topics more deeply, fostering intellectual curiosity and promoting a deeper engagement with the subject matter. Ultimately, these assignments are not just a measure of your learning, but a means to equip you with the skills needed for real-world applications and future challenges.

## **MAKE A COPY OF THIS NOTBOOK!!**

For your assignment to be graded, you **must** make a copy of this Colab notebook in your GDrive and you this copy as your worksheet.

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# YOU MUST RUN THIS CELL FIRST

try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this assignment.")
    COLAB = False

Mounted at /content/drive
Note: using Google CoLab
david.senseman@gmail.com


Your GMAIL address **must** appear in the output in order for your work to be graded.

### Define functions

The cell below creates several functions that are needed for this assignment. If you don't run this cell, you will receive errors later when you try to run some cells.

In [11]:
# Create functions for this lesson

def list_float_columns(dataframe):
    """
    Create a list of all columns in a DataFrame that contain float values.

    Parameters:
    dataframe (pd.DataFrame): The DataFrame to check.

    Returns:
    list: A list of column names that contain float values.
    """
    float_columns = [col for col in dataframe.columns if dataframe[col].dtype == 'float64']
    return float_columns

# **Assigment 1: Multiclass Classification**

**Assignment_01** is specifically designed to assess your ability to write the Python/Tensorflow/Keras code necessary to build neural networks that can perform binary classification, multiclass classification or regression on tabular data. Based on the 1st digit in your myUTSA ID ('abc123'), you have been assigned to perform **multiclass classification**.

Unlike your class lessons, you will **not** be given examples that you can use to simply copy-and-paste code. Rather, you will be given a problem to solve and it will be up to you to use code snippets that you have been given previously to solve different aspects of this assignment. And unlike your class lessons, your will **not** be given the correct output. In other words, this assignment is basically how you would solve an actual biomedical problem.


# **Multiclass Classification by Neural Networks**

**Multiclass classification** is a supervised learning task where the goal is to predict one of several categories for each instance in a dataset. Unlike binary classification, which involves two classes, multiclass classification deals with three or more possible outcomes. Neural networks are powerful tools for this task, and when applied to tabular data (data structured in tables with features as columns), they can yield accurate predictions.

**Neural Network Structure for Multiclass Classification:**

1. **Output Layer:** The output layer uses a softmax activation function to convert logits into probabilities, ensuring the sum of these probabilities equals 1. This allows the model to predict the most probable class.

2. **Loss Function:** Cross-entropy loss is commonly used to measure the difference between predicted and true probability distributions, making it suitable for multi-class problems.

3. **Optimization:** Gradient descent optimizes the model by adjusting weights to minimize the loss function, typically using an optimizer like Adam.

**Preprocessing Steps:**

- **Handling Missing Data:** Imputation or removal strategies are applied.
- **Feature Scaling:** Normalization techniques such as standardization or min-max scaling ensure features have comparable scales.
- **Categorical Encoding:** Techniques like one-hot encoding convert categorical variables into numerical form.


**Model Evaluation:**

Metrics include accuracy, precision, recall, F1-score, and confusion matrices to assess model performance comprehensively.

**Dataset Splitting:**

Data is divided into training and test sets to evaluate generalization effectively.

**Architecture Considerations:**

- **Dense Layers:** Typically sufficient for tabular data.
- **Dropout Layers:** Prevent overfitting by deactivating neurons during training.
- **Regularization:** Techniques like L2 regularization reduce overfitting.

**Overfitting Prevention:**

Techniques include cross-validation, early stopping (monitoring validation loss), and data augmentation if applicable.

**Frameworks:**

Using libraries like TensorFlow simplifies model construction. For example, a Keras sequential model with dense layers, softmax output, and Adam optimizer is standard practice.

# **Your Dataset for Assignmment_01**

The **_first_**  digit in your myUTSA ID (e.g. "abc123") will determine which dataset you are to analyze for this assignment and which type of neural network (i.e. classification or regression) you will need to construct. For example, if your myUTSA ID was **vue682**, then your first digit is the number `6`.

**---WARNING------WARNING------WARNING------WARNING------WARNING------WARNING---**

You are **not** free to choose any dataset for this assignment. If analyze the wrong dataset, your assignment will **NOT BE GRADED**. If you are uncertain which dataset you should be working on, contact your Instructor for help. Remember, your score in this assignment will have a large impact on your course grade so please be careful.


| Last Digit myUTSA ID     | Dataset to Analyze      | Neural Network Type
---------------------------|-------------------------|-----------------
0                          | Hepatitis               | Binary Classification
1                          | Coimbra Breast Cancer   | Binary Classification
2                          | Parkinson Speech        | Binary Classification
3                          | Indian Liver            | Binary Classification
4                          | Thyroid Replacement     | Multiclass Classification
5                          | Wine Quality            | Multiclass Classification
6                          | Bone Marrow Transplant  | Regression
7                          | Bioavailability         | Regression
8                          | METABRIC Breast Cancer  | Regression
9                          | Diabetes Progression    | Regression



## **Descriptions of Data Sets for Multiclass Classification**

This section describes the various datasets, information for downloading them, and what variable(s) your network should predict. Remember, you do not earn and credit if you analyze the wrong dataset. Pay particular attention to the variable for Multiclass Classification for your assigned dataset. You will need to know the name of this feature when you are constructing yor X- and Y-feature vectors.


---------------------------------

## **Wine Quality dataset - 1st myUTSA Digit = 5**

#### **Filename:** `wine_quality.csv`
#### **Response Variable for Multiclass Classification (Y):** `quality`


The **Wine Quality dataset** contains physicochemical properties of wines, which are used to predict the quality of the wine. The dataset includes two sets of wines: red and white vinho verde wines from the north of Portugal. Each wine sample is described by 11 features and a quality score. For this assignment you will focus only on the red wine.

#### Features:
1. **fixed acidity**: Non-volatile acids that remain in the wine during fermentation.
2. **volatile acidity**: Represents acetic acid content, which can give wine an undesirable vinegar flavor.
3. **citric acid**: Provides freshness to wines and can contribute to flavor.
4. **residual sugar**: The amount of sugar remaining after fermentation.
5. **chlorides**: The amount of salt in the wine.
6. **free sulfur dioxide**: SO₂ protects wine from oxidation and microbial growth.
7. **total sulfur dioxide**: Sum of free and bound SO₂.
8. **density**: Density of the wine, closely related to alcohol and sugar content.
9. **pH**: Measures the acidity or alkalinity of wine.
10. **sulphates**: A wine preservative and antioxidant.
11. **alcohol**: Alcohol content of the wine.

#### Output:
- **quality**: Wine quality score, likely on a scale from 0 to 10, based on sensory data from wine experts.

--------------------------------

# **General Instructions**

To make the assignment mpre manageable, you will given a number of specific steps to perform. For each step you will be given a specific example in a particular class lesson that you can use for a reference. For example, in **Step 1: Download and Extract Data** you are given **REF: Class_01_6 (Example 1)**. That means Example 1 in Class_01_6 provides similar code that you could use to complete that step of this assignment.

### **Variable Names**

In writing your code for this assignment, you are free to give your variables any name that makes sense to you. This includes the name of the DataFrame that holds your data. If you copy-and-paste code from earlier Class assignments, you always have to edit the name of the DataFrame to match the name you select for this assignment.

When it has been necessary to give an example name for a DataFrame in an instruction, the DataFrame has been called `dataFrameDF`. You will need to edit the name `dataFrameDF` to match the actual name you have given to your DataFrame.

### **Can I Use AI?**

You are free to use AI (e.g. Microsoft Co-Pilot) to help you complete your assignment---but you need to be very careful.

While AI can be very helpful in correcting coding errors, it can also give you code that is totally incorrect for this assignment. A small number of students in previous classes have flunked their assignment by using AI code that did not generate the correct output.

If your aren't sure what you are doing, it's much, much safer to get help with any of your coding problems from your course instructor and/or course TA's.

### **Step - 1: Download and Extract Data**

**REF: Class_01_6 (Example 1)**

As usual, a coding project starts with downloading a dataset. Since your dataset is in tabular form, you should use Pandas to read the datafile and store the information in a DataFrame. You are free to choose the name for all or your variables in this assignment, including the name of your DataFrame.

In the cell below, write the code to download your datafile from the course server and create a Pandas DataFrame to store your data. Use the function `display()` to show the data in 8 rows. For full credit, you need to show **all** of the columns in your DataFrame. You can do this by using the following code:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', 8)
~~~

Remember, you will need to edit the name `dataFrameDF` to make the name you select for your DataFrame.


In [1]:
# Step 1: Download and Extract Data

import numpy as np
import pandas as pd

# Read file and create DataFrame
wqDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/wine_quality.csv",
#    index_col=0,
    na_values=['NA','?'])

# Set max columns and max rows
pd.set_option('display.max_columns', wqDF.shape[1])
pd.set_option('display.max_rows', 8)

# Display DataFrame
display(wqDF)

Unnamed: 0,quality,color,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
0,5,red,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4
1,5,red,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8
2,5,red,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8
3,6,red,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6493,5,white,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6
6494,6,white,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4
6495,7,white,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8
6496,6,white,6.0,0.21,0.38,0.8,0.020,22.0,98.0,0.98941,3.26,0.32,11.8


If your code is correct you should see a table with a relatively large number of columns that may extend beyond the right edge of your notebook.

## **Step 2: Describe DataFrame**

**REF: Class_01_6**

The `df.describe()` command in Python is used with pandas DataFrames. It provides a summary of statistics for each column in the DataFrame. By default, it will return the count, mean, standard deviation, min, 25th percentile (Q1), median (50th percentile), 75th percentile (Q3), and max values for numerical columns. It can be a handy tool for getting a quick overview of your dataset!

Use the `df.describe()` command to summaries the data in your DataFrame. Make sure to replace the `df` with the actual name of your DataFrame.

Again use these commands to set the display options:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', 8)
~~~


In [2]:
# Step 2: Describe DataFrame

import pandas as pd

# Set max columns and max rows
pd.set_option('display.max_columns', wqDF.shape[1])
pd.set_option('display.max_rows', 8)

# Describe() method
wqDF.describe()

Unnamed: 0,quality,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6491.0,6497.0,6497.0
mean,5.818378,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218409,0.531268,10.491801
std,0.873255,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160704,0.148806,1.192712
min,3.0,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0
25%,5.0,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5
50%,6.0,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3
75%,6.0,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3
max,9.0,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9


If your code is correct you should see a table with a relatively large number of columns that may extend beyond the right edge of your notebook and 8 rows countaining the summary statistics for each column.

## **Step 3: Find Missing Values**

**REF: Class_01_6 (Example 4)**

In **Biostatistics**, finding and replacing missing values is crucial for several reasons:

1. **Preserving Data Integrity**: Missing values can distort the analysis, leading to biased results. By addressing missing values, you ensure that your conclusions are based on complete and accurate data.

2. **Statistical Validity**: Many statistical tests and models require complete data. Missing values can reduce the statistical power of these tests, making it difficult to detect real effects.

3. **Avoiding Data Loss**: Simply discarding rows or columns with missing values can lead to a significant loss of data, especially if the dataset is already small. Imputing missing values helps retain as much data as possible.

4. **Model Accuracy**: Machine learning models can be sensitive to missing data. Handling missing values appropriately can improve the performance and accuracy of predictive models.

5. **Consistency**: Different columns in a dataset may have varying levels of missing data. Addressing these inconsistencies helps in creating a more uniform dataset, which is easier to analyze and interpret.

The `df.isnull()` command in pandas is used to detect missing values in a DataFrame. It returns a DataFrame of the same shape as the original, but with Boolean values: `True` where the value is missing (`NaN`) and `False` where the value is not missing.

In the cell below, use the command `df.isnull()` to find and print out any missing values in your DataFrame.

To make sure you see all of the values, use this code to set your display output:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', dataFrameDF.shape[0])

~~~

In [3]:
# Step 3: Find missing values

import pandas as pd

# Find the locations of missing data
missing_locations = wqDF.isnull().any()

# Set max columns and max rows
pd.set_option('display.max_columns', wqDF.shape[1])
pd.set_option('display.max_rows', wqDF.shape[0])

# Display the locations of missing data
print(missing_locations)

quality                 False
color                   False
fixed_acidity           False
volatile_acidity        False
citric_acid             False
residual_sugar          False
chlorides               False
free_sulfur_dioxide     False
total_sulfur_dioxide    False
density                 False
pH                       True
sulphates               False
alcohol                 False
dtype: bool


Take a good look at your output. In particular, pay attention to is the data type (`Dtype`).

Columns with data types that are either `int64` or `float64` are numeric while columns that are `object` are _categorical_ (string) values which must be converted into numerical values during data preprocessing.

If you don't see any column name followed by the`object` Dtypes, you don't have to worry about mapping strings to integers during preprocessing.

## **Step 4: Find Missing Values**

**REF: Class_01_6**

In **Biostatistics**, finding and replacing missing values is crucial for several reasons:

1. **Preserving Data Integrity**: Missing values can distort the analysis, leading to biased results. By addressing missing values, you ensure that your conclusions are based on complete and accurate data.

2. **Statistical Validity**: Many statistical tests and models require complete data. Missing values can reduce the statistical power of these tests, making it difficult to detect real effects.

3. **Avoiding Data Loss**: Simply discarding rows or columns with missing values can lead to a significant loss of data, especially if the dataset is already small. Imputing missing values helps retain as much data as possible.

4. **Model Accuracy**: Machine learning models can be sensitive to missing data. Handling missing values appropriately can improve the performance and accuracy of predictive models.

5. **Consistency**: Different columns in a dataset may have varying levels of missing data. Addressing these inconsistencies helps in creating a more uniform dataset, which is easier to analyze and interpret.

The `df.isnull()` command in pandas is used to detect missing values in a DataFrame. It returns a DataFrame of the same shape as the original, but with Boolean values: `True` where the value is missing (`NaN`) and `False` where the value is not missing.

In the cell below, use the command `df.isnull()` to find and print out any missing values in your DataFrame.

To make sure you see all of the values, use this code to set your display output:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', dataFrameDF.shape[0])

~~~

In [4]:
# Step 4: Find Missing Values

import pandas as pd

# Find the locations of missing data
missing_locations = wqDF.isnull().any()

# Set max columns and max rows
pd.set_option('display.max_columns', wqDF.shape[1])
pd.set_option('display.max_rows', wqDF.shape[0])

# Display the locations of missing data
print(missing_locations)

quality                 False
color                   False
fixed_acidity           False
volatile_acidity        False
citric_acid             False
residual_sugar          False
chlorides               False
free_sulfur_dioxide     False
total_sulfur_dioxide    False
density                 False
pH                       True
sulphates               False
alcohol                 False
dtype: bool


Inspect the your output and see if one (or more) columns have the word `True` after them. These columns contain one (or more) missing values. Make note of the column name since your will need to handle these missing values in the next step.  

## **STEP 4: Replace Missing Values**

**REF: Class_01_6 (Example 5)**

One common strategy for replacing missing values is to use the `median` of the column to replace the missing value. The `median` is used instead of the column `mean` since the `median` is a robust measure of central tendency and is less affected by outliers compared to the `mean`.

To do this:
1. Calculate the median of the column: Use the `median()` function to find the median value of the column.

2. Replace the missing values: Use the `fillna()` function to replace the missing values with the median.

If your DataFrame had one (or more) columns with missing values, you will need write the Python code to replace these missing values with the column's `median` value in the cell below.

After you have replaced the missing values, again use the `df.isnull()` command to print out the names of columns with missing values to make sure all of this missing values have been taken care of. (Just use the same code you wrote in **Step 3**.)




In [5]:
# Step 4: Replace Missing Values

import pandas as pd

# Find the median of the column Insulin
pH_med = wqDF['pH'].median()

# Print out the median value
print(f"The median value = {pH_med} for pH.")
print(f"Replacing missing values with {pH_med}.")

# Use fillna method
wqDF['pH'] = wqDF['pH'].fillna(pH_med)

# Find the locations of missing data
print("\nLooking for missing values...")  # The \n means print a newline
missing_locations = wqDF.isnull().any()

# Display the locations of missing data
print(missing_locations)

The median value = 3.21 for pH.
Replacing missing values with 3.21.

Looking for missing values...
quality                 False
color                   False
fixed_acidity           False
volatile_acidity        False
citric_acid             False
residual_sugar          False
chlorides               False
free_sulfur_dioxide     False
total_sulfur_dioxide    False
density                 False
pH                      False
sulphates               False
alcohol                 False
dtype: bool


If your code is correct, you should see the same output as in **Step 3**, but this time all of the column names should be followed by the word `False`.

## **Step 5: Display Non-numeric Categories**

**REF: Class_02_2 (Example 3 - Step 1)**

When building neural networks it is especially important to know which columns contain non-numeric data ("strings").

In the cell below, write the code to print out a list of columns in your DataFrame that contain non-numeric data.

In [6]:
# Step 5: Display non-numeric categoried

import pandas as pd

# Select columns
non_numerical_columns = wqDF.select_dtypes(exclude='number').columns

# Print result
print(*non_numerical_columns)

color


## **Step 6: Print Names in a Categorical Column**

**REF: Class_02_2 (Example 3 - Step 3)**

In the cell below, write the code to print out a list of categorical values (strings) that are in the non-numeric column found in **Step 5**.

In [7]:
# Step 6: Print names in a categorical column

import pandas as pd

# Generate a list with only unique values
numCat = list(wqDF['color'].unique())

# Print out the results
print(f'Number of categories: {len(numCat)}')
print(f'String names: {numCat}')

Number of categories: 2
String names: ['red', 'white']


The output above gives the string names that you need to map to integers in next step.


## **Step 7: Map Strings to Integer Values**

**REF: Class_02_2 (Example 3 - Step 2)**

In the cell below, write the code to map each string shown in the output above to a different integer value. To make sure your mapping worked as intended, use the `display(df)` function to display your updated DataFrame.

In [8]:
# Step 7: Map Strings to Integer Values

# Set max columns and max rows
pd.set_option('display.max_columns', wqDF.shape[1])
pd.set_option('display.max_rows', 8)


# Define the mapping dictionary
mapping = {'red': 1, 'white': 2}

# Check if all values to be mapped are present in the column
unique_values = wqDF['color'].unique()

# Find values in the column that are not in the mapping dictionary
missing_values = [value for value in unique_values if value not in mapping]

if missing_values:
    print(f"Error: The following values in the 'color' column are not in the mapping dictionary: {missing_values}")
    print(f"Error: Either your mapping is wrong or you have already converted the strings to integers")
else:
    # Map the 'Gender' column using the mapping dictionary
    wqDF['color'] = wqDF['color'].map(mapping)
    print("Wine quality data after mapping:")
    display(wqDF)

Wine quality data after mapping:


Unnamed: 0,quality,color,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
0,5,1,7.4,0.70,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4
1,5,1,7.8,0.88,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8
2,5,1,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8
3,6,1,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6493,5,2,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6
6494,6,2,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4
6495,7,2,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8
6496,6,2,6.0,0.21,0.38,0.8,0.020,22.0,98.0,0.98941,3.26,0.32,11.8


## **Step 8: Shuffle and Reindex your DataFrame**

**REF: Class_02.3 (Example 2)**

**Shuffling and reindexing** the data are important steps when building neural networks for several reasons:

1. **Preventing Overfitting**: When the data is in a specific order, the model might learn patterns that are a result of the order rather than the actual data. Shuffling helps to prevent the model from overfitting to these spurious patterns.

2. **Improving Generalization**: By shuffling the data, you ensure that the model is exposed to a wide variety of examples during training, which helps it to generalize better to new, unseen data.

3. **Reducing Bias**: Shuffling ensures that the distribution of data is more uniform across the training batches. This reduces the risk of introducing bias, especially if the data has some inherent order that might otherwise affect the model's performance.

4. **Enhancing Convergence**: Neural networks often train faster and more reliably when the data is shuffled. This is because the model updates its parameters in a more representative and balanced manner.

5. **Ensuring Robustness**: By reindexing the data, you avoid potential issues that could arise from relying on the original indices, which might have some underlying structure or grouping that could affect the training process.


In the cell below, write the code to shuffle and reindex your DataFrame. Then display your shuffled DataFrame using the following display settings:

~~~text
# Set max columns and max rows
pd.set_option('display.max_columns', dataFrameDF.shape[1])
pd.set_option('display.max_rows', 8)
~~~

In [9]:
# Step 8: Shuffle and Reindex your DataFrame

import pandas as pd
import numpy as np

# Set the random seed to 42
np.random.seed(42)

# Use random.permutation function for shuffling & reindexing
wqDF = wqDF.reindex(np.random.permutation(wqDF.index))

# Set max columns and max rows
pd.set_option('display.max_columns', wqDF.shape[1])
pd.set_option('display.max_rows', 8)

# Display the shuffled & reindexed DataFrame
display(wqDF)

Unnamed: 0,quality,color,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol
3103,8,2,7.0,0.17,0.74,12.8,0.045,24.0,126.0,0.99420,3.26,0.38,12.2
1419,5,1,7.7,0.64,0.21,2.2,0.077,32.0,133.0,0.99560,3.27,0.45,9.9
4761,7,2,6.8,0.39,0.34,7.4,0.020,38.0,133.0,0.99212,3.18,0.44,12.0
4690,6,2,6.3,0.28,0.47,11.2,0.040,61.0,183.0,0.99592,3.12,0.51,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5191,7,2,5.6,0.28,0.27,3.9,0.043,52.0,158.0,0.99202,3.35,0.44,10.7
5226,5,2,6.4,0.37,0.20,5.6,0.117,61.0,183.0,0.99459,3.24,0.43,9.5
5390,5,2,6.5,0.26,0.50,8.0,0.051,46.0,197.0,0.99536,3.18,0.47,9.5
860,5,1,7.2,0.62,0.06,2.7,0.077,15.0,85.0,0.99746,3.51,0.54,9.5


If your code is correct, the output generated above by **Step 6** should look almost identical to the output you got after running **Step 1** except that this index numbers shown in the leftmore column should be in a random order.

## **Step 9: Normalize Data to the Z-values**

**REF: Class_02_2 (Example 2)**

When building neural networks, it is important to normalize the data using Z-score normalization for several reasons:

1. **Standardization**: Z-score normalization standardizes the features to have a mean of 0 and a standard deviation of 1. This ensures that all features contribute equally to the model and prevents features with larger scales from dominating the learning process.

2. **Improved Convergence**: Neural networks often use gradient-based optimization algorithms. Normalized data can lead to more stable and faster convergence of these algorithms because the gradients are more balanced and less likely to explode or vanish.

3. **Enhanced Performance**: Normalized data often leads to better model performance. By bringing all features to a similar scale, the model can learn more effectively and produce more accurate predictions.

4. **Handling Outliers**: Z-score normalization reduces the impact of outliers by transforming the data to a standard scale. Outliers will have high positive or negative Z-scores, but their influence will be mitigated compared to unnormalized data.

5. **Consistency**: Normalization ensures that different features are on the same scale, which is particularly important when combining multiple datasets or using features with different units of measurement.

To make things easier, only normalize the `float` data in your dataset by converting these numbers into their Z-values.

In the cell below, use the function `list_float_columns()` to find the names of all of the columns in your DataFrame that contain float number. (NOTE: This is custom function that was created near the beginning of the assignment). Once you know the column names, use can use the `zscore()` function to normalize the values only in the columns containing floats. Print out the names of the columns containing floats, and the first five Z-scores in each of these columns.

In [12]:
# Step 9: Normalize Data to the Z-values

import pandas as pd
from scipy.stats import zscore

float_columns = list_float_columns(wqDF)
print(f"Columns with float values: {float_columns}")

for col in float_columns:
    wqDF[col] = zscore(wqDF[col])

# Print the first 5 values of each float column
for col in float_columns:
    print(f"First 5 values in column '{col}': {wqDF[col].head().tolist()}")

Columns with float values: ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol']
First 5 values in column 'fixed_acidity': [-0.16608919286941418, 0.37389510870031095, -0.32037042188933573, -0.7060734944391392, 0.14247326517042896]
First 5 values in column 'volatile_acidity': [-1.0306285979470193, 1.8243655783403003, 0.3057516547832154, -0.36243847158190184, 0.06277342701408162]
First 5 values in column 'citric_acid': [2.8998445339598002, -0.7476133552587604, 0.14704612700239625, 1.0417056092635524, -0.8164333154326954]
First 5 values in column 'residual_sugar': [1.5463712438446127, -0.6817189480996253, 0.4113064290805669, 1.210055743173784, 1.7775881505558073]
First 5 values in column 'chlorides': [-0.3149750696080817, 0.5985040391814981, -1.028630623349941, -0.45770618035645344, -0.05805907026101235]
First 5 values in column 'free_sulfur_dioxide': [-0.36766435481755216

**HINT:** If you see this error:

~~~text
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-9b058d8fbfea> in <cell line: 0>()
      4 from scipy.stats import zscore
      5
----> 6 float_columns = list_float_columns(dataFrameDF)
      7 print(f"Columns with float values: {float_columns}")
      8

NameError: name 'list_float_columns' is not defined
~~~~

it means you didn't run the cell called `Define functions` at the beginning of this assignment.


## **Step 10: Pre-process Data for Neural Network Training**

**REF: Class_04_2 (Example 9)**

In the cell below, write the code to preprocess the data in your DataFrame to make ready to feed into your neural network.

**NOTES: Please follow these directions carefully:**

1. Since you have already converted your float values into their Z-scores, you should **not** normalize any data during your pre-processing. In other words, converting Z-scores into Z-scores, a second time, is not a good thing.

2. Basically all you need to do is write the code to generate your `X-feature vector` and your `Y-feature vector`. Your `Y-values` will be in the column that is the response variable. The particular response variable (`Y` values) for your particular dataset was specified in the dataset description at the start of this assignment.

3. When generating your `X-feature vector`, you should use _all_ of the columns in your DataFrame **EXCEPT** for the column containing the `Y-values`.

4. Since you will be building a **Multiclass Classification** neural network, you **_must_** one-hot encode the Y-values when generating your `Y-feature vector`.

5. Do **not** split your data into training and test set yet. You will do the split later.

6. When you are done, generating both your `X-` and `Y-` feature vectors, print out the first 4 values in each vector.

In [16]:
# Step 10 - Preprocess Data for Neural Network Training

import pandas as pd
from scipy.stats import zscore

# Generate column list for preprocessing
wqX_columns = wqDF.columns.drop('quality')

# Generate X feature vector
wqX = wqDF[wqX_columns].values
wqX = np.asarray(wqX).astype(np.float32)

# One-Hot encode the target column and generate Y
dummies = pd.get_dummies(wqDF['quality'], dtype=float) # Classification
wq_classes = dummies.columns
wqY = dummies.values
wqY = np.asarray(wqY).astype(np.float32)

# Print out X and Y
np.set_printoptions(suppress=True,precision=4)
print("The first 4 X-values are:")
print(wqX[0:4])
print("\nTheir corresponding Y-values are:")
print(wqY[0:4])

The first 4 X-values are:
[[ 2.     -0.1661 -1.0306  2.8998  1.5464 -0.315  -0.3677  0.1815 -0.1656
   0.259  -1.0166  1.4323]
 [ 1.      0.3739  1.8244 -0.7476 -0.6817  0.5985  0.0831  0.3053  0.3013
   0.3213 -0.5462 -0.4962]
 [ 2.     -0.3204  0.3058  0.147   0.4113 -1.0286  0.4212  0.3053 -0.8593
  -0.2391 -0.6134  1.2646]
 [ 2.     -0.7061 -0.3624  1.0417  1.2101 -0.4577  1.7171  1.19    0.408
  -0.6126 -0.1429 -0.8316]]

Their corresponding Y-values are:
[[0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]]


## **Step 11:  Construct and Compile Neural Network**

**REF: Class_04_2 (Example 10)**

In the cell below, use the Keras/Tensorflow libraries to split your data into test and train splits, making the test size = 0.25, and set the random state to 42.

Then construct and compile a multiclass classification neural network with 3 hidden layers but do not start your training in this step. You can use the code in Example 10 in Class_04_2 as a template for your model. Since this neural network will perform multiclass classification, the number of output neurons has to be equal to the number of classes in your Y feature vector.

After you construct your neural network, compile it but do not start training ("fitting") it yet.

In [17]:
# Step 11: Construct and compile neural network

from keras.models import Sequential
from keras.layers import Dense, Input
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam
from keras.metrics import Precision, Recall
import numpy as np
from sklearn.model_selection import train_test_split

# Split into train/test--------------------------------------------------------
wqX_train, wqX_test, wqY_train, wqY_test = train_test_split(
    wqX, wqY, test_size=0.25, random_state=10)


# Construct model---------------------------------------------------------------
wqModel = Sequential()
wqModel.add(Input(shape=(wqX.shape[1],)))
wqModel.add(Dense(100, activation='relu',
                kernel_initializer='random_normal'))  # Hidden 1
wqModel.add(Dense(50,activation='relu',
                   kernel_initializer='random_normal')) # Hidden 2
wqModel.add(Dense(25,activation='relu',
                   kernel_initializer='random_normal')) # Hidden 3
wqModel.add(Dense(wqY.shape[1],activation='softmax',
                kernel_initializer='random_normal')) # Output

# Compile model------------------------------------------------------------------
wqModel.compile(loss='categorical_crossentropy',
              optimizer=Adam(),
              metrics =['accuracy'])

If your code is correct, you should **not** see any output after running the previous cell.

## **Step 12: Print Summary of Your Model**

**REF: Class_04_2**

The `model.summary()` command in deep learning frameworks like Keras and TensorFlow provides a detailed summary of the neural network model. This summary includes useful information about the model's architecture, including:

1. **Layer Names and Types:** The name and type (e.g., Dense, Conv2D, LSTM) of each layer in the model.

2. **Output Shape:** The shape of the output produced by each layer.

3. **Number of Parameters:** The total number of trainable and non-trainable parameters in each layer. This includes both the weights and biases.

4. **Model Parameters Summary:** A total count of all trainable and non-trainable parameters in the model.

In the cell below, use the `model.summary()` command to print out the information about your neural network.



In [18]:
# Step 12: Print Summary of Your Model

wqModel.summary()

## **Step 13: Create Early Stopping Monitor**

**REF: Class_04_2**

An **Early Stopping Monitor** is a technique used during the training of neural networks to prevent overfitting and improve the model's generalization to new, unseen data. It works by monitoring the performance of the model on a validation dataset and stopping the training process when the performance starts to degrade.

#### Here’s how it works:

1. **Monitoring Performance**: Early stopping keeps track of a specific metric, such as validation loss or validation accuracy, during each epoch of training.

2. **Patience**: It has a parameter called "patience," which defines the number of epochs to wait for an improvement in the monitored metric before stopping the training. If the performance does not improve for a specified number of epochs, the training is stopped.

3. **Restore Best Weights**: In some implementations, early stopping can also restore the model weights to the state that resulted in the best performance on the validation set.

#### Benefits of early stopping include:

- **Preventing Overfitting**: By stopping training when the model starts to overfit the training data, early stopping helps maintain good generalization performance.
- **Saving Time and Resources**: It avoids unnecessary training epochs, saving computational resources and time.

In the cell below, write the code to create an Early Stopping Monitor that monitors `val_loss`. Set the parameter `patience` to `10`.



In [19]:
# Step 13: Create early stopping monitor

PATIENCE=10

wqMonitor = EarlyStopping(monitor='val_loss', min_delta=1e-3,
    patience=PATIENCE, verbose=1, mode='auto', restore_best_weights=True)

If your code is correct, you should not see any output.

## **Step 14: Train the Model**

In the cell below, write the Python code to train the neural network that you constructed in **Step 11**. Set the number of epochs to `100`. Make sure the parameter `verbose` is set to `2` so that the output of each epoch is written out.

In [20]:
# Step 14: Train the Model

# Set variables
EPOCHS=100
VERBOSE=2

# Train model
wqModel.fit(wqX_train,wqY_train,validation_data=(wqX_test,wqY_test),
          callbacks=[wqMonitor],verbose=VERBOSE,epochs=EPOCHS)



Epoch 1/100
153/153 - 3s - 20ms/step - accuracy: 0.4323 - loss: 1.4031 - val_accuracy: 0.5440 - val_loss: 1.1363
Epoch 2/100
153/153 - 2s - 11ms/step - accuracy: 0.5185 - loss: 1.1309 - val_accuracy: 0.5292 - val_loss: 1.0727
Epoch 3/100
153/153 - 1s - 4ms/step - accuracy: 0.5435 - loss: 1.0905 - val_accuracy: 0.5378 - val_loss: 1.0744
Epoch 4/100
153/153 - 1s - 4ms/step - accuracy: 0.5476 - loss: 1.0719 - val_accuracy: 0.5471 - val_loss: 1.0549
Epoch 5/100
153/153 - 1s - 4ms/step - accuracy: 0.5470 - loss: 1.0640 - val_accuracy: 0.5612 - val_loss: 1.0425
Epoch 6/100
153/153 - 1s - 4ms/step - accuracy: 0.5519 - loss: 1.0528 - val_accuracy: 0.5385 - val_loss: 1.0356
Epoch 7/100
153/153 - 0s - 3ms/step - accuracy: 0.5499 - loss: 1.0441 - val_accuracy: 0.5705 - val_loss: 1.0216
Epoch 8/100
153/153 - 1s - 4ms/step - accuracy: 0.5601 - loss: 1.0333 - val_accuracy: 0.5612 - val_loss: 1.0258
Epoch 9/100
153/153 - 1s - 4ms/step - accuracy: 0.5616 - loss: 1.0292 - val_accuracy: 0.5563 - val_los

<keras.src.callbacks.history.History at 0x7edbf0aa3750>

## **Step 15: Convert Prediction Probabilites into Actual Prediction Values**

**REF: Class_04_2, (Example 11)**

In the cell below, write the code to convert the prediction probabilities from your neural network model into actual prediction probabilities using the `df.armax()` method. Print out the prediction probabilites for the first 6 samples, and then print out the prediction values for the same 6 samples.

In [21]:
# Step 15: Convert Prediction Probabilites into Actual Prediction Values

import tensorflow as tf

# Define the predict function outside of any loops to prevent retracing
@tf.function
def predict_probabilities(model, inputs):
    return model(inputs)

# Use model to predict probabilities for subjects in x_test
wqProb = predict_probabilities(wqModel, wqX_test)

# Print prediction probabilities (first 6 samples)
print(f"Prediction Probabilities (first 6):\n{wqProb[0:6]}")

# Use argmax to convert probabilities to prediction values
wqPred = tf.argmax(wqProb, axis=1)

# Print out prediction values (first 6 samples)
print(f"Prediction Values (first 6):\n{wqPred.numpy()[0:6]}")

Prediction Probabilities (first 6):
[[0.0028 0.0122 0.5941 0.3842 0.0061 0.0006 0.    ]
 [0.0011 0.0121 0.4914 0.479  0.0156 0.0007 0.    ]
 [0.0004 0.0031 0.156  0.6933 0.1324 0.0146 0.0001]
 [0.0033 0.0172 0.0799 0.3912 0.4262 0.0767 0.0055]
 [0.0049 0.0375 0.3761 0.5099 0.0664 0.0051 0.0001]
 [0.006  0.0164 0.2563 0.4877 0.1848 0.0473 0.0015]]
Prediction Values (first 6):
[2 2 3 4 3 3]


## **Step 16: Compute the Percent Accuracy**

**REF: Class_04_2 (Example 12)**

In the cell below, compute the percent accuracy of your neural network model. Print out the first 6 values in your `Y_compare` variable as well as your percent accuracy score.

NOTE: **Step 14** uses the variable holding the actual prediction values generated in **Step 15**, so you will need to run **Step 15** before your run this step.

In [22]:
# Step 16: Compute percent accuracy

import numpy as np
from sklearn import metrics

# Generate array containing actual class type
wqY_compare = np.argmax(wqY_test, axis=1)
print(f'First 6 values in Y_compare: {wqY_compare[0:6]}')

# Make sure ecgPred is also a NumPy array on the CPU
if not isinstance(wqPred, np.ndarray):
    wqPred = wqPred.numpy()

# Compute the percentage score
score = metrics.accuracy_score(wqY_compare, wqPred)

# Print out the score
print("Accuracy score: {}".format(score))

First 6 values in Y_compare: [3 2 3 3 3 4]
Accuracy score: 0.5766153846153846


## **Assignment Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Copy of Assignment_01.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Poly-A Tail**

## **DeepSeek**

![__](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/DeepSeek_logo.svg/1920px-DeepSeek_logo.svg.png)

**DeepSeek** (Chinese: 深度求索; pinyin: Shēndù Qiúsuǒ) is a Chinese artificial intelligence company that develops open-source large language models (LLMs). Based in Hangzhou, Zhejiang, it is owned and funded by Chinese hedge fund High-Flyer, whose co-founder, Liang Wenfeng, established the company in 2023 and serves as its CEO.

The DeepSeek-R1 model provides responses comparable to other contemporary large language models, such as OpenAI's GPT-4o and o1. It is trained at a significantly lower cost—stated at US \$6 million compared to \$100 million for OpenAI's GPT-4 in 2023—and approximately a tenth of the computing power used for Meta's comparable model, LLaMA 3.1. DeepSeek's AI models were developed amid United States sanctions on China and other countries for chips used to develop artificial intelligence, which were intended to restrict the ability of these countries to develop advanced AI systems. Lesser restrictions were later announced that would affect all but a few countries.

On 10 January 2025, DeepSeek released its first free chatbot app, based on the DeepSeek-R1 model, for iOS and Android; by 27 January, DeepSeek had surpassed ChatGPT as the most-downloaded free app on the iOS App Store in the United States, causing Nvidia's share price to drop by 18%. DeepSeek's success against larger and more established rivals has been described as "upending AI" and ushering in "a new era of AI brinkmanship". DeepSeek's compliance with Chinese government censorship policies and its data collection practices have also raised concerns over privacy and information control in the model, prompting regulatory scrutiny in multiple countries.

DeepSeek makes its generative artificial intelligence algorithms, models, and training details open-source, allowing its code to be freely available for use, modification, viewing, and designing documents for building purposes.However, reports indicate that the API version hosted in China applies content restrictions in accordance with local regulations, limiting responses on topics such as the Tiananmen Square massacre and Taiwan’s status. The company reportedly vigorously recruits young AI researchers from top Chinese universities, and hires from outside the computer science field to diversify its models' knowledge and abilities.

**Background**

In February 2016, High-Flyer was co-founded by AI enthusiast Liang Wenfeng, who had been trading since the 2007–2008 financial crisis while attending Zhejiang University. They began stock-trading with a deep learning model running on GPU on October 21, 2016. Prior to this, they used CPU-based models, mainly linear models. Most trading was done by AI by the end of 2017.

By 2019, he established High-Flyer as a hedge fund focused on developing and using AI trading algorithms. By 2021, High-Flyer exclusively used AI in trading, often using Nvidia chips. DeepSeek has made its generative artificial intelligence chatbot open source, meaning its code is freely available for use, modification, and viewing. This includes permission to access and use the source code, as well as design documents, for building purposes.

In 2021, while running High-Flyer, Liang began stockpiling Nvidia GPUs for an AI project.[20] According to 36Kr, Liang had built up a store of 10,000 Nvidia A100 GPUs, which are used to train AI, before the United States federal government imposed AI chip restrictions on China.

On 14 April 2023,[22] High-Flyer announced the start of an artificial general intelligence lab dedicated to research developing AI tools separate from High-Flyer's financial business. Incorporated on 17 July 2023, with High-Flyer as the investor and backer, the lab became its own company, DeepSeek. Venture capital firms were reluctant to provide funding, as they considered it unlikely that the venture would be able to generate an "exit" in a short period of time.

On May 16, 2023, the company Beijing DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd. incorporated under the control of Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd. As of May 2024, Liang Wenfeng held 84% of DeepSeek through two shell corporations.

After releasing DeepSeek-V2 in May 2024, which offered strong performance for a low price, DeepSeek became known as the catalyst for China's AI model price war. It was quickly dubbed the "Pinduoduo of AI", and other major tech giants such as ByteDance, Tencent, Baidu, and Alibaba began to cut the price of their AI models to compete with the company. Despite the low price charged by DeepSeek, it was profitable compared to its rivals that were losing money.

DeepSeek is focused on research and has no detailed plans for commercialization, which also allows its technology to avoid the most stringent provisions of China's AI regulations, such as requiring consumer-facing technology to comply with the government's controls on information.

DeepSeek's hiring preferences target technical abilities rather than work experience, resulting in most new hires being either recent university graduates or developers whose AI careers are less established. Likewise, the company recruits individuals without any computer science background to help its technology understand other topics and knowledge areas, including being able to generate poetry and perform well on the notoriously difficult Chinese college admissions exams (Gaokao).

**Training framework**

High-Flyer/DeepSeek has built at least two computing clusters, Fire-Flyer (萤火一号) and Fire-Flyer 2 (萤火二号). Fire-Flyer began construction in 2019 and finished in 2020, at a cost of 200 million yuan. It contained 1,100 GPUs interconnected at a rate of 200 Gbps. It was 'retired' after 1.5 years in operation. Fire-Flyer 2 began construction in 2021 with a budget of 1 billion yuan.[18] It was reported that in 2022, Fire-Flyer 2's capacity had been utilized at over 96%, totaling 56.74 million GPU hours. Of those GPU hours, 27% was used to support scientific computing outside the company.

Fire-Flyer 2 consisted of co-designed software and hardware architecture. On the hardware side, there are more GPUs with 200 Gbps interconnects. The cluster is divided into two "zones", and the platform supports cross-zone tasks. The network topology was two fat trees, chosen for its high bisection bandwidth. On the software side, there are

* **3FS (Fire-Flyer File System):** A distributed parallel file system. It was specifically designed for asynchronous random reads from a dataset, and uses Direct I/O and RDMA Read. In contrast to standard Buffered I/O, Direct I/O does not cache data. Caching is useless for this case, since each data read is random, and would not be reused.
* **hfreduce:** Library for asynchronous communication, originally designed to replace Nvidia Collective Communication Library (NCCL).[30] It was mainly used for allreduce, especially of gradients during backpropagation. It is asynchronously run on the CPU to avoid blocking kernels on the GPU.[28] It uses two-tree broadcast like NCCL.
* **hfai.nn:** Software library of commonly used operators in neural network training, similar to torch.nn in PyTorch.
* **HaiScale Distributed Data Parallel (DDP):** Parallel training library that implements various forms of parallelism in deep learning such as Data Parallelism (DP), Pipeline Parallelism (PP), Tensor Parallelism (TP), Experts Parallelism (EP), Fully Sharded Data Parallel (FSDP) and Zero Redundancy Optimizer (ZeRO). It is similar to PyTorch DDP, which uses NCCL on the backend.
* **HAI Platform:** Various applications such as task scheduling, fault handling, and disaster recovery.
During 2022, Fire-Flyer 2 had 5000 PCIe A100 GPUs in 625 nodes, each containing 8 GPUs. At the time, they chose to exclusively use PCIe instead of DGX version of A100, since at the time the models they trained could fit within a single 40 GB GPU VRAM, so there was no need for the higher bandwidth of DGX (i.e. they required only data parallelism but not model parallelism).[30] Later, they also incorporated NVLinks and NCCL, to train larger models that required model parallelism.