Skip to content

Commit

Permalink
fixing typo
Browse files Browse the repository at this point in the history
  • Loading branch information
mariakakis committed Jun 13, 2023
1 parent 3d69d87 commit 19008ba
Showing 1 changed file with 10 additions and 12 deletions.
22 changes: 10 additions & 12 deletions projects/project3/Project_3.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,7 @@
"Wm2-b5b4ZkbK",
"X7kZTFKuddYV",
"NT6kGuNEZ9Eo",
"AcknqoGYaEkV",
"-yYdGwfNhw1F",
"Vv4sbZvCodVN",
"_JqQs1EObfOb",
"ZH3cviZ9qVVT",
"yKk471KciFR6",
Expand Down Expand Up @@ -116,7 +114,7 @@
{
"cell_type": "markdown",
"source": [
"You should also read the details of the challenge at https://physionet.org/challenge/2019/. \n",
"You should also read the details of the challenge at https://physionet.org/challenge/2019/.\n",
"Pay particular attention to what kind of data you will be working with and what the objective of the challenge is.\n"
],
"metadata": {
Expand All @@ -135,7 +133,7 @@
{
"cell_type": "markdown",
"source": [
"Machine learning is a vast and deep topic, and this assignment will only scratch the surface. \n",
"Machine learning is a vast and deep topic, and this assignment will only scratch the surface.\n",
"Although you should be able to complete this assignment strictly by following our instructions, it may help to read through some materials to familiarize yourself with important concepts.\n",
"There are hundreds of blogs, videos, and online courses that people use to learn about machine learning.\n",
"Here are a couple of our favorites:\n",
Expand Down Expand Up @@ -168,7 +166,7 @@
"We will need to take advantage of numerous software packages that will give us access to functions that will make our lives easier. You have already seen one of these packages (`numpy`) in a previous assignment, but the full list is below:\n",
"* [`numpy`](https://numpy.org/) provides efficient implementations of many math operations, particularly ones that involve arrays and matrices.\n",
"* [`pandas`](https://pandas.pydata.org) is an extremely useful library for working with data, especially when your data can be organized into tables (e.g., a `.csv` or `.xlsx` file).\n",
"* [`scikit-learn`](https://scikit-learn.org/stable/index.html) is a tool for data mining, data analysis and machine learning. \n",
"* [`scikit-learn`](https://scikit-learn.org/stable/index.html) is a tool for data mining, data analysis and machine learning.\n",
"* [`imbalanced-learn`](https://github.com/scikit-learn-contrib/imbalanced-learn) contains functions to help you work with imbalanced data.\n",
"* [`cache-em-all`](https://pypi.org/project/cache-em-all/) will allow us to save the result of a function so that it will only take a long time the first time you call it. In other words, `cache-em-all` allows you to save the result of the function so that it only takes 5 minutes the first time the function is called; whenever you call it again, it will only take a couple of seconds to load the saved result. This package was created by a TA from a previous iteration of the course.\n",
"\n",
Expand Down Expand Up @@ -371,7 +369,7 @@
{
"cell_type": "markdown",
"source": [
"Each row is assigned an numerical **index** value. By default, each row's index is the same as its row number (i.e., the first row has an index of 0, the second row has an index of 1, etc). However, this may not always be true since there may be situations when you need to either index your rows differently or shuffle your rows while keeping track of their original position. You can access specific rows of a `DataFrame` using either its position in the `DataFrame` (with the method `.loc[]`) or its index (with the method `.iloc[]`):"
"Each row is assigned an numerical **index** value. By default, each row's index is the same as its row number (i.e., the first row has an index of 0, the second row has an index of 1, etc). However, this may not always be true since there may be situations when you need to either index your rows differently or shuffle your rows while keeping track of their original position. You can access specific rows of a `DataFrame` using either its assigned label in the `DataFrame` (with the method `.loc[]`) or its positional index (with the method `.iloc[]`):"
],
"metadata": {
"id": "SgT1taeyo9JG"
Expand Down Expand Up @@ -594,7 +592,7 @@
{
"cell_type": "markdown",
"source": [
"Moving forward, we will consider all of the columns other than the patient ID to be our **features** (the inputs to our model) and our **label** (the desired output of our model) to be `SepsisLabel`. Because we are working with data that has labels, we will be using a type of learning called **supervised learning**. More specifically, we are trying to predict an output $y$ given an input $X$ and we have examples of $(X, y)$ pairs that we can use to train our system. This is in contrast to **unsupervised learning** where we would only have $X$ at our disposal. \n",
"Moving forward, we will consider all of the columns other than the patient ID to be our **features** (the inputs to our model) and our **label** (the desired output of our model) to be `SepsisLabel`. Because we are working with data that has labels, we will be using a type of learning called **supervised learning**. More specifically, we are trying to predict an output $y$ given an input $X$ and we have examples of $(X, y)$ pairs that we can use to train our system. This is in contrast to **unsupervised learning** where we would only have $X$ at our disposal.\n",
"\n",
"There are two types of tasks within supervised learning: **regression** and **classification**. Regression involves continuous labels like white blood cell count or the price of a house. Classification involves discrete labels like healthy/sick or cat/dog/bird; the different possible values that a discrete label can take for a given problem are known as **classes**. We will be doing binary (2-class) classification in this assignment."
],
Expand Down Expand Up @@ -685,13 +683,13 @@
"source": [
"def evaluate(actual, predicted, prefix=\"\"):\n",
" \"\"\"\n",
" Compares the predicted labels to the actual lables and prints out multiple metrics \n",
" Compares the predicted labels to the actual lables and prints out multiple metrics\n",
" of classification performance: precision, recall, and overall accuracy\n",
"\n",
" actual corresponds to the ground truth labels that were in the collected dataset\n",
" predicted corresponds to the labels that were predicted by the model\n",
" prefix is a string you can use to specify which data or model corresponds to the given analysis\n",
" \n",
"\n",
" \"\"\"\n",
" precision = precision_score(actual, predicted)\n",
" recall = recall_score(actual, predicted)\n",
Expand Down Expand Up @@ -740,7 +738,7 @@
{
"cell_type": "markdown",
"source": [
"What happens if you run `train_simple()` multiple times? The results are not consistent because there are multiple parts of our code that rely on randomness: how we split the data into training and test sets, how the model fits itself to training data, etc. Inconsistent results make it hard to replicate our work, so we should have a way to be able to produce the same result every time. \n",
"What happens if you run `train_simple()` multiple times? The results are not consistent because there are multiple parts of our code that rely on randomness: how we split the data into training and test sets, how the model fits itself to training data, etc. Inconsistent results make it hard to replicate our work, so we should have a way to be able to produce the same result every time.\n",
"\n",
"Many random number generators are not truly random; they are actually pseudorandom in that they generate numbers based on a **seed**. Therefore, if we set the value of the seed, we can control the sequence of random numbers the generator produces. We can do this by using the following lines of code:"
],
Expand Down Expand Up @@ -855,7 +853,7 @@
"source": [
"Recall that there is a significant difference between the number of positive (`SepsisLabel = 1`) and negative examples (`SepsisLabel = 0`) in our dataset. When a dataset is significantly imbalanced, classifiers may become biased because there are too few examples of a particular class.\n",
"\n",
"Within the `DecisionTreeClassifier`, we can set `class_weight=\"balance\"` to tell the classifier to weigh classes according to how many examples there are in the training data. For example, if there are twice as many negative examples than positive examples, the classifier will consider incorrect predictions on positive examples twice as bad as a incorrect predictions on negative examples while training. \n",
"Within the `DecisionTreeClassifier`, we can set `class_weight=\"balance\"` to tell the classifier to weigh classes according to how many examples there are in the training data. For example, if there are twice as many negative examples than positive examples, the classifier will consider incorrect predictions on positive examples twice as bad as a incorrect predictions on negative examples while training.\n",
"\n",
"Another way to deal with class imbalance is by adjusting how the data is sampled. In this case, we are going to **undersample** the data, which means that were are going to keep all of the data in the minority class and decreasing the size of the majority class. The alternative would be **oversampling**, which means that we would be keeping the size of the majority class and repeating examples in the minority class. In your `train_simple()` implementation, use the `RandomUnderSampler` from `imbalanced-learn` to undersample the training data before you fit your model:"
],
Expand Down Expand Up @@ -1002,7 +1000,7 @@
"\n",
"* **Number of folds:** The skeleton we wrote for `train_stratified()` performs 5-fold cross-validation because of the parameter we passed to `GroupKFold`. What happens when you increase the number of folds to 10? What happens when you decrease the number of folds to 2 or 3?\n",
"\n",
"You could also improve your pipeline by looking into feature pre-processing, feature selection, and automated hyperparameter tuning. However, these topics are outside of the scope of this course. If you are interested in learning more, feel free to reach out to the instructors! \n",
"You could also improve your pipeline by looking into feature pre-processing, feature selection, and automated hyperparameter tuning. However, these topics are outside of the scope of this course. If you are interested in learning more, feel free to reach out to the instructors!\n",
"\n",
"The final deliverable for this assignment is a report that explains the different configurations you tried, the accuracy those configurations achieved, and written explanations of why you believe those results happened. Since there are so many configurations to choose from and each person may be splitting the data differently, we are not expecting everyone to achieve a definitive correct answer. What we are looking for is careful experimentation and viable explanations for the results of those experiments. You may find it helpful to use tables or graphs to systematically present your accuracy numbers. The report should be single-column, single-spaced, and no longer than 3 pages including tables and graphs. Save the report as either `report.pdf` or `report.docx`."
],
Expand Down

0 comments on commit 19008ba

Please sign in to comment.