blog: Add a blog about dask (#103)

OpenScienceLabs · Jan 31, 2024 · 01df02f · 01df02f
1 parent c35ae6e
commit 01df02f
Show file tree

Hide file tree

Showing 3 changed files with 603 additions and 0 deletions.
diff --git a/pages/blog/scaling-machine-learning-projects-with-dask/image.png b/pages/blog/scaling-machine-learning-projects-with-dask/image.png
diff --git a/pages/blog/scaling-machine-learning-projects-with-dask/index.ipynb b/pages/blog/scaling-machine-learning-projects-with-dask/index.ipynb
@@ -0,0 +1,337 @@
+{
+ "cells": [
+  {
+   "cell_type": "raw",
+   "id": "3b2f4318-ba2d-40ae-9783-47a934627807",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "title: Scaling Machine Learning Projects with Dask\n",
+    "slug: scaling-machine-learning-projects-with-dask\n",
+    "date: 2024-01-30\n",
+    "authors:\n",
+    "  - Satarupa Deb\n",
+    "tags:\n",
+    "  - open-source\n",
+    "  - Machine Learning\n",
+    "  - Dask\n",
+    "  - python\n",
+    "categories:\n",
+    "  - Python\n",
+    "  - Machine Learning\n",
+    "  - Parallel computing\n",
+    "description: |\n",
+    "  This blog explores the usability of Dask in handling significant\n",
+    "  challenges related to scaling models with large datasets, training and testing\n",
+    "  of models, and implementing parallel computing functionalities. It also\n",
+    "  provides a brief overview of the basic features of Dask.\n",
+    "thumbnail: /image.png\n",
+    "template: blog-post.html\n",
+    "\n",
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d900c1a1-1613-4910-9e10-96c39fabf8dc",
+   "metadata": {},
+   "source": [
+    "# Scaling Python Data Analysis with Dask\n",
+    "\n",
+    "As the volume of digital data continues to expand, coupled with the emergence of new machine learning models each day, companies are increasingly dependent on data analysis to inform business decisions. To effectively test and train these models with large datasets, scaling becomes a significant challenge, particularly in connecting Python analysts to distributed hardware. This challenge is particularly pronounced in the realm of data science and machine learning workloads. The complexities in this process often result in discrepancies that can lead to flawed training of data and consequently, inaccurate results.\n",
+    "\n",
+    "In this blog, we will suggest an effective solution to address the challenges discussed above. Imagine how much easier the scaling process would be with a Python library that could perform on both parallel and distributed computing. This is precisely what **Dask** does!\n",
+    "\n",
+    "## What is Dask?\n",
+    "\n",
+    "**Dask** is an open-source, parallel and distributed computing library in Python that facilitates efficient and scalable processing of large datasets. It is designed to seamlessly integrate with existing Python libraries and tools, providing a familiar interface for users already comfortable with Python and its libraries like NumPy, Pandas, Jupyter, Scikit-Learn, and others but want to scale those workloads across a cluster. Dask is particularly useful for working with larger-than-memory datasets, parallelizing computations, and handling distributed computing.\n",
+    "\n",
+    "## Setting Up Dask\n",
+    "\n",
+    "Installing Dask is straightforward and can be done using Conda or Pip. For Anaconda users, Dask comes pre-installed, highlighting its popularity in the data science community. Alternatively, you can install Dask via Pip, ensuring to include the complete extension to install all required dependencies automatically.\n",
+    "\n",
+    "```bash\n",
+    "#install using conda\n",
+    "conda install dask\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "1269bbbf-a5e4-43b8-af7e-a2ad8a97c732",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#install using conda\n",
+    "!pip install \"dask[complete]\" -q"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f139ceb9-73e5-40f6-a978-038e89917c3b",
+   "metadata": {},
+   "source": [
+    "## Basic Concepts of Dask\n",
+    "\n",
+    "At its core, Dask extends the capabilities of traditional tools like pandas, NumPy, and Spark to handle larger-than-memory datasets. It achieves this by breaking large objects like arrays and dataframes into smaller, manageable chunks or partitions. This approach allows Dask to distribute computations efficiently across all available cores on your machine.\n",
+    "\n",
+    "## Dask DataFrames\n",
+    "\n",
+    "One of the standout features of Dask is its ability to handle large datasets effortlessly. With Dask DataFrames, you can seamlessly work with datasets exceeding 1 GB in size. By breaking the dataset into smaller chunks, Dask ensures efficient processing while maintaining the familiar interface of pandas DataFrames.\n",
+    "\n",
+    "## Features of Dask:\n",
+    "\n",
+    "1. **Parallel and Distributed Computing:**\n",
+    "   Dask enables parallel and distributed computing, making it a go-to solution for handling datasets that exceed the available memory of a single machine. It breaks down computations into smaller tasks, allowing for concurrent execution and optimal resource utilization."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "11414276-66ff-4bba-a8e0-e8315d49038f",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[ 9986.14978723 10073.19700192  9985.6576724  ...  9923.550924\n",
+      "  9978.70237439  9990.8504103 ]\n"
+     ]
+    }
+   ],
+   "source": [
+    "#demonstrating parallel and distributed computing using Dask\n",
+    "import dask.array as da\n",
+    "\n",
+    "# Create a large random array\n",
+    "x = da.random.random((10000, 10000), chunks=(1000, 1000))  # 10,000 x 10,000 array\n",
+    "\n",
+    "# Perform element-wise computation\n",
+    "y = x * 2\n",
+    "\n",
+    "# Compute the sum along one axis\n",
+    "z = y.sum(axis=0)\n",
+    "\n",
+    "# Compute the result in parallel across multiple cores or distributed across a cluster\n",
+    "result = z.compute()\n",
+    "\n",
+    "print(result)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ef1efce-6001-4895-8ae1-395f8a199c3f",
+   "metadata": {},
+   "source": [
+    "2. **Dask Collections:**\n",
+    "   Dask provides high-level abstractions known as Dask collections, which are parallel and distributed counterparts to familiar Python data structures. These include `dask.array` for parallel arrays, `dask.bag` for parallel bags, and `dask.dataframe` for parallel dataframes, seamlessly integrating with existing Python libraries."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "0e29dd78-875c-4aea-91fb-19f2aaf1cb4a",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "#Using dask collections\n",
+    "import dask.array as da\n",
+    "\n",
+    "# Creating a dummy dataset using Dask\n",
+    "x = da.ones((100, 100), chunks=(10, 10))  # Creating a 100x100 array of ones with chunks of 10x10\n",
+    "y = x + x.T  # Adding the transpose of x to itself\n",
+    "result = y.mean()  # Calculating the mean of y\n",
+    "\n",
+    "# Computing the result\n",
+    "print(result.compute())  # Outputting the computed result\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "89052ee6-ca13-47d2-bb7d-c415e42a9a17",
+   "metadata": {},
+   "source": [
+    "3. **Lazy Evaluation:**\n",
+    "   One of Dask's core principles is lazy evaluation. Instead of immediately computing results, Dask builds a task graph representing the computation. The actual computation occurs only when the results are explicitly requested. This approach enhances efficiency and allows for optimizations in resource usage.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "137b40a2-9c56-4834-bc14-c8cd5c89a244",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.5112260135512784\n"
+     ]
+    }
+   ],
+   "source": [
+    "#Lazy Evalution with dask\n",
+    "import dask.dataframe as dd\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "# Creating a dummy dataset\n",
+    "num_rows = 100  # Number of rows\n",
+    "data = {\n",
+    "    'column': np.random.randint(0, 100, size=num_rows),\n",
+    "    'value': np.random.rand(num_rows)\n",
+    "}\n",
+    "\n",
+    "# Creating a Pandas DataFrame\n",
+    "df_pandas = pd.DataFrame(data)\n",
+    "\n",
+    "# Saving the Pandas DataFrame to a CSV file\n",
+    "df_pandas.to_csv('your_dataset.csv', index=False)\n",
+    "\n",
+    "# Reading the CSV file into a Dask DataFrame\n",
+    "df = dd.read_csv('your_dataset.csv')\n",
+    "\n",
+    "# Filtering the Dask DataFrame\n",
+    "filtered_df = df[df['column'] > 10]\n",
+    "\n",
+    "# Calculating the mean of the filtered DataFrame\n",
+    "mean_result = filtered_df['value'].mean()\n",
+    "\n",
+    "# No computation happens until explicitly requested\n",
+    "print(mean_result.compute())  # Outputting the computed result\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d2800b0-5ac3-429b-8e15-bec442f45df3",
+   "metadata": {},
+   "source": [
+    "4. **Integration with Existing Libraries:**\n",
+    "   Dask is designed to integrate seamlessly with popular Python libraries, such as NumPy, Pandas, and scikit-learn. This means that you can often replace existing code with Dask equivalents without significant modifications."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "c3e4cc2e-f84f-4033-af96-f5d68bf0d5c1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[1.40940907 1.32044698 1.48172367 ... 1.4266846  0.84142743 0.33577001]\n",
+      " [1.32044698 1.02252065 1.17250384 ... 0.40216939 1.58544767 1.12049071]\n",
+      " [1.48172367 1.17250384 1.98886224 ... 0.86271956 1.27977778 0.95136532]\n",
+      " ...\n",
+      " [1.4266846  0.40216939 0.86271956 ... 1.44980096 1.38712404 0.75331149]\n",
+      " [0.84142743 1.58544767 1.27977778 ... 1.38712404 1.50814693 1.01719649]\n",
+      " [0.33577001 1.12049071 0.95136532 ... 0.75331149 1.01719649 1.47050452]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Integration with NumPy\n",
+    "import dask.array as da\n",
+    "import numpy as np\n",
+    "\n",
+    "# Generating a random NumPy array\n",
+    "x_np = np.random.random((100, 100))\n",
+    "\n",
+    "# Converting the NumPy array to a Dask array\n",
+    "x_dask = da.from_array(x_np, chunks=(10, 10))\n",
+    "\n",
+    "# Performing operations on the Dask array\n",
+    "y_dask = x_dask + x_dask.T\n",
+    "\n",
+    "# Computing the result\n",
+    "print(y_dask.compute())\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a6bfffe1-fa8f-46fd-b734-248fd1631df6",
+   "metadata": {},
+   "source": [
+    "5. **Task Scheduling:**\n",
+    "   Dask dynamically schedules the execution of tasks, optimizing the computation based on available resources. This makes it well-suited for handling larger-than-memory datasets efficiently. Dask is a powerful tool for data scientists and engineers working with large-scale data processing tasks, providing a convenient way to scale computations without requiring a complete rewrite of existing code.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "2356c4a1-2c78-426d-9019-127cd754ec62",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "(1, 4, 9, 16, 25, 36, 49, 64, 81)\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Dynamic task scheduling with Dask\n",
+    "import dask\n",
+    "\n",
+    "@dask.delayed\n",
+    "def square(x):\n",
+    "    return x * x\n",
+    "\n",
+    "data = [1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
+    "results = []\n",
+    "\n",
+    "for value in data:\n",
+    "    result = square(value)\n",
+    "    results.append(result)\n",
+    "\n",
+    "final_result = dask.compute(*results)\n",
+    "print(final_result)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6dbaf782-33c5-4956-a332-2620a49bd554",
+   "metadata": {},
+   "source": [
+    "## Conclusion:\n",
+    "\n",
+    "**Dask** stands as a powerful tool in the Python ecosystem, addressing the challenges posed by the ever-increasing scale of data. Its ability to seamlessly integrate with existing libraries, support lazy evaluation, and provide parallel and distributed computing makes it a valuable asset for data scientists and engineers tackling large-scale data processing tasks. Whether you're working on a single machine with moderately sized datasets or dealing with big data challenges that require distributed computing, Dask offers a flexible and efficient solution. As we continue to navigate the era of big data, Dask proves to be a key player in unlocking the full potential of Python for scalable and parallelized data processing. Start harnessing the power of Dask today and supercharge your data processing workflows!\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}