<a href="https://colab.research.google.com/github/Kumari1996/codeDemo2/blob/My-Branch/Copy_of_Data_Wrangling_Code_Optimisation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **👨🏻‍🎓 Learning Objective 👨🏻‍🎓**

#first try in github- yesssssssssss

### **Introduction**

👋 Hi students! In this lesson, we'll be discussing the importance of code optimization in data wrangling, techniques for optimizing code, best practices for efficient data wrangling, and strategies for parallel and distributed data wrangling.

🐍 Data wrangling involves transforming and manipulating data to make it more useful for analysis. As datasets grow in size, data wrangling code can become slow and inefficient, leading to longer processing times and slower analysis.
🤖 This is where code optimization comes in. Code optimization involves making your code more efficient and streamlined, to improve its performance and speed up processing times. This is especially important when working with large datasets.

📊 In this lesson, we'll explore some techniques for optimizing your code, including using vectorized operations, avoiding loops, reducing memory usage, and parallel processing.

🚀 We'll also discuss best practices for efficient data wrangling, such as understanding your data, keeping your code clean, and collaborating with others.

💻 Finally, we'll discuss strategies for parallel and distributed data wrangling, which involve using multiple processors or distributed computing resources to speed up processing times and improve the performance of your code.

📝 By understanding the importance of code optimization and using these techniques and best practices, you can improve the performance of your data wrangling code and make it easier to work with large datasets and perform complex analyses.

🐍 So, let's dive in and explore the world of code optimization in data wrangling!


### **Primary Goals**

👋 Hi there! In this lesson, we'll be discussing the primary goals of understanding the importance of code optimization in data wrangling, techniques for code optimization, best practices for efficient data wrangling, and strategies for parallel and distributed data wrangling, using some wonderful emojis. 📊

🐍 Here are the primary goals of this lesson:

🤔 To understand the importance of code optimization in data wrangling and how it can improve the performance of your code and speed up processing times.

🚀 To learn techniques for optimizing your code, such as using vectorized
operations, avoiding loops, reducing memory usage, and parallel processing.

📚 To explore best practices for efficient data wrangling, such as understanding your data, keeping your code clean, and collaborating with others.

🤖 To understand strategies for parallel and distributed data wrangling, which involve using multiple processors or distributed computing resources to speed up processing times and improve the performance of your code.

📝 By achieving these primary goals, you'll have a better understanding of how to optimize your data wrangling code and make it more efficient, allowing you to work with large datasets and perform complex analyses more easily.

🐍 So, let's get started and learn how to optimize our data wrangling code using these wonderful techniques and best practices!


# **📖 Learning Material 📖**

## **Introduction**

👋 You know that data wrangling, also known as data cleaning or data preparation, is the process of cleaning and transforming raw data into a more usable format. Code optimization, on the other hand, is the process of improving the performance of code by reducing its resource usage, improving its efficiency, and making it more maintainable.


🔍 Data wrangling and code optimization are closely related because they both involve working with data and code to make it more usable and efficient. In the context of data wrangling, code optimization is important because the process of cleaning and transforming data can be computationally intensive and time-consuming. By optimizing the code, we can reduce the time and resources required to perform these tasks and make the data wrangling process more efficient.


💻 There are several techniques that can be used to optimize code for data wrangling, such as vectorization, caching, and parallelization.

Vectorization involves performing operations on entire arrays or matrices of data rather than on individual elements, which can significantly improve the performance of code.

Caching involves storing the results of computationally intensive operations in memory so that they can be quickly accessed later, which can also improve performance.

Parallelization involves splitting up a task into smaller, independent parts that can be executed simultaneously on multiple processors or cores, which can further improve performance.

##**Importance of Code Optimization in Data Wrangling**

🔍 Data wrangling involves cleaning, transforming, and preparing raw data for analysis, which can be computationally intensive and time-consuming.

💻 Code optimization is the process of improving the performance of code by reducing its resource usage, improving its efficiency, and making it more maintainable.

🌟 By optimizing their code for data wrangling, developers can reduce the time and resources required to perform tasks such as cleaning, transforming, and merging data.

📈 Code optimization can also improve the scalability of data wrangling tasks, allowing data analysts and data scientists to work with larger datasets more efficiently.

🚀 Some techniques for code optimization in data wrangling include:

1. Vectorization, which involves performing operations on entire arrays or matrices of data rather than on individual elements

2. Caching, which involves storing the results of computationally intensive operations in memory for faster access

3. Parallelization, which involves splitting up a task into smaller, independent parts that can be executed simultaneously on multiple processors or cores

4. Concurrency, which involves the ability to execute multiple tasks or processes simultaneously, either on the same processor/core (using techniques such as threads or asyncio) or across multiple processors/cores (using techniques such as multiprocessing or distributed computing).


##**1. NumPy for vectorization:**



NumPy is a popular Python library for numerical computing that allows users to perform operations on entire arrays or matrices of data rather than on individual elements. This is known as vectorization and can significantly improve the performance of code by reducing the number of loop iterations required. For example, consider the following code for calculating the dot product of two arrays:

In [None]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

dot_product = 0
for i in range(len(a)):
    dot_product += a[i] * b[i]

print(dot_product)

32


This code uses a for loop to iterate over each element of the arrays and calculate the dot product. However, this can be slow and inefficient for large arrays. With NumPy, we can instead use the dot function to perform the same operation in a single step:

In [None]:
dot_product=np.dot(a,b)
dot_product

32

## **Activity 1:**


Can you use NumPy to find the sum of squares of the first 10 positive integers without using any loops?

**🤔 Hint:**

*The sum of squares of the first n positive integers can be computed using the formula n(n+1)(2n+1)/6.*

In [None]:
%%time
#Your Code Here

#solution using loop
def sum_square_loop(n):
  sum = 0
  for i in range(1,n+1):
    sum = sum + (i**2)
  return sum
sum_square_loop(100)

CPU times: user 89 µs, sys: 0 ns, total: 89 µs
Wall time: 95.1 µs


338350

In [None]:
%%time
#solution using loop
def sum_square_without_loop(n):
  return (n * (n+1) * (2*n+1))/6
sum_square_without_loop(100)

CPU times: user 23 µs, sys: 2 µs, total: 25 µs
Wall time: 30 µs


338350.0

##**2. Joblib for Caching:**



Caching involves storing the results of computationally intensive operations in memory for faster access. This can be especially useful in data wrangling tasks where the same operation may be performed multiple times on the same data. The joblib library provides a simple way to cache function calls in Python. For example, consider the following code for calculating the mean of a list of numbers:

In [None]:
from joblib import Memory

# create a Memory object to cache function calls
mem = Memory(location='cache')

# define a function to calculate the mean of a list of numbers
@mem.cache
def calc_mean(numbers):
    print("Calculating mean...")
    total = sum(numbers)
    return total / len(numbers)

# define a list of numbers to use for testing
my_list = [1, 2, 3, 4, 5]

# call the function the first time, which will calculate the mean and cache the result
result1 = calc_mean(my_list)

# call the function again with the same argument, which will retrieve the cached result instead of recalculating it
result2 = calc_mean([1, 2, 3, 4, 5, 6])

# print the results
print(result1)  # should be 3.0
print(result2)  # should also be 3.0, and the "Calculating mean..." message should not be printed


________________________________________________________________________________
[Memory] Calling __main__--content-<ipython-input-1ba9d3b9df3d>.calc_mean...
calc_mean([1, 2, 3, 4, 5])
Calculating mean...
________________________________________________________calc_mean - 0.0s, 0.0min
________________________________________________________________________________
[Memory] Calling __main__--content-<ipython-input-1ba9d3b9df3d>.calc_mean...
calc_mean([1, 2, 3, 4, 5, 6])
Calculating mean...
________________________________________________________calc_mean - 0.0s, 0.0min
3.0
3.5


In this example, we start by importing the **Memory** class from the **joblib** library. We then create a Memory object with the **location** parameter set to **'cache'**, which specifies where to store the cached results.

Next, we define a function called **calc_mean** that takes a list of numbers as input and calculates their mean. We decorate the function with the **@mem.cache** decorator, which tells joblib to cache the results of this function call based on its input arguments.

We then define a list of numbers my_list to use for testing. We call the calc_mean function twice with the same argument, which should result in the function being executed only once and the second call retrieving the cached result.

Finally, we print the results of the two function calls, which should both be 3.0 (the mean of the input list), and the "Calculating mean..." message should only be printed once, during the first function call.





## **Activity 2**

Suppose you have a function that takes a long time to run and returns a dictionary. Can you use Joblib to cache the results of this function so that subsequent calls to the function with the same arguments return the cached value without running the function again?

**🤔 Hint:**

*Use the Memory object from Joblib to create a memory cache for the function.*



In [None]:
#Your Code Here

In [None]:
#Solution
from joblib import Memory

# create a Memory object to cache function calls
mem = Memory(location='cache')

# define a function to calculate the mean of a list of numbers
@mem.cache
def student_results(dict):
  print("Processing the Results...")
  result = {}
  for i in dict.keys():
    if dict[i] > 90:
      result.update({i: 'Excellent'})
    elif dict[i] > 80:
      result.update({i: 'Very Good'})
    else:
      result.update({i: 'Pass'})
  return result

students = {'Alice': 85, 'Bob': 95, 'Charlie': 75}

#Function Call 1
print(student_results(students))

#Function Call 2
print(student_results(students))


________________________________________________________________________________
[Memory] Calling __main__--content-<ipython-input-d169f96c03d6>.student_results...
student_results({'Alice': 85, 'Bob': 95, 'Charlie': 75})
Processing the Results...
__________________________________________________student_results - 0.0s, 0.0min
{'Alice': 'Very Good', 'Bob': 'Excellent', 'Charlie': 'Pass'}
{'Alice': 'Very Good', 'Bob': 'Excellent', 'Charlie': 'Pass'}


##<b>3. Pandarallel for Parallelization

Let's install Pandarallel module first.

In [None]:
!pip install pandarallel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandarallel
  Downloading pandarallel-1.6.5.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill>=0.3.1 (from pandarallel)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: pandarallel
  Building wheel for pandarallel (setup.py) ... [?25l[?25hdone
  Created wheel for pandarallel: filename=pandarallel-1.6.5-py3-none-any.whl size=16677 sha256=9b2f1a4c8ef6d17735e91c33a5d64797b9ee036b682a28ec2e198d0203498c14
  Stored in directory: /root/.cache/pip/wheels/50/4f/1e/34e057bb868842209f1623f195b74fd7eda229308a7352d47f
Successfully built pandarallel
Installing collected packages: dill, pandarallel
Successfully installed dill-0.3.6 pandarallel-1.6.5


In [None]:
import pandas as pd
from pandarallel import pandarallel

# initialize pandarallel
pandarallel.initialize()

# create a sample dataframe
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})

# define a function to apply to each row of the dataframe
def my_func(row):
    return row['A'] + row['B']

# use pandarallel to apply the function to each row in parallel
result = df.parallel_apply(my_func, axis=1)

# print the result
print(result)


INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
0     7
1     9
2    11
3    13
4    15
dtype: int64


##<b>Activity 3

Can you use Pandarallel to parallelize a Pandas apply operation on a DataFrame in Google Colab?

**Instructions:**

1. Create a Pandas DataFrame containing a list of integers.

2. Define a custom function that squares a number.

3. Use Pandas apply method to apply the custom function to each element of the DataFrame.

4. Parallelize the apply operation using Pandarallel.

5. Compare the execution times of the apply operation with and without Pandarallel.

**Hints:**

1. Use the "pandarallel" library to enable parallel processing.

2. Use the "apply" method with the "parallel=True" option to parallelize the apply operation.


In [None]:
import pandas as pd
from pandarallel import pandarallel
import numpy as np

In [None]:
# initialize pandarallel
pandarallel.initialize()

# create a sample dataframe
df = pd.DataFrame({'Numbers': [np.arange(1,10)]})

# define a function to apply to each row of the dataframe
def my_func(num):
    return num**2

# use pandarallel to apply the function to each row in parallel
result = df.parallel_apply(my_func)

# print the result
print(result)

INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
                             Numbers
0  [1, 4, 9, 16, 25, 36, 49, 64, 81]


In [None]:
# create a sample dataframe
df = pd.DataFrame({'Numbers': [np.arange(1,10)]})

# define a function to apply to each row of the dataframe
def my_func(num):
    return num**2

# use pandarallel to apply the function to each row in parallel
result = df.apply(my_func)

# print the result
print(result)

                             Numbers
0  [1, 4, 9, 16, 25, 36, 49, 64, 81]


##<b>4. Concurrency

Concurrency is achieved by dividing the task into smaller parts that can be executed independently and concurrently. Concurrency can be achieved using multiple techniques such as threads, asyncio, and multiprocessing.

Here is an example of how to achieve concurrency using the threading library:

In [None]:
import threading
import time

status = False

def worker():
  c = 0
  while not status:
    time.sleep(1)
    c = c + 1
    print(c)
#worker()
threading.Thread(target=worker).start()

sta = input("Enter Exit to terminate : ")
if sta == 'Exit':
   status = True

1
2
3
4
5
6
7
8
9
Enter Exit to terminate : Exit
10


This demonstrates how concurrency allows us to execute multiple tasks simultaneously, making our code more efficient and responsive.

##<b>🤔Do you Know🤔

Parallelization and concurrency are related concepts but they refer to different ways of achieving efficient use of computer resources.

Concurrency refers to the ability to execute multiple tasks or processes simultaneously. This can be achieved through techniques such as threads, asyncio, and multiprocessing. With concurrency, multiple tasks can be executed at the same time, but each task is executed on the same processor or core. Concurrency allows you to execute tasks in an overlapping manner, which can improve responsiveness and throughput.

Parallelization, on the other hand, refers to the ability to execute multiple tasks or processes simultaneously across multiple processors or cores. Parallelization can be achieved through techniques such as multiprocessing, multithreading, or distributed computing. With parallelization, multiple tasks can be executed on different processors or cores simultaneously, allowing you to achieve higher throughput and better performance.

In summary, concurrency is about executing multiple tasks simultaneously on the same processor or core, while parallelization is about executing multiple tasks simultaneously on multiple processors or cores. Both techniques can be used to achieve efficient use of computer resources, but they have different advantages and limitations depending on the task at hand.

## **Best Practices for Efficient Data Wrangling**

Here are some best practices for Efficient Data Wrangling:

**🔍 1. Understand the Data:**

Before starting any data wrangling task, it is important to thoroughly understand the data and its structure. This includes knowing the data types, column names, missing values, and any potential outliers. This can help in identifying the appropriate data cleaning and transformation techniques needed.

**📝 2. Document the Process:**

Keeping track of the steps taken during data wrangling can be helpful for future reference and troubleshooting. This includes documenting the data sources, cleaning and transformation techniques used, and any assumptions made during the process.

**🚀 3. Use Vectorization:**

Vectorization is a technique that allows for performing operations on an entire array or data frame at once, instead of looping through each element. This can significantly improve the performance of data wrangling tasks, especially when dealing with large datasets. Libraries such as NumPy and Pandas provide vectorization capabilities.

**💻 4. Utilize Parallelization:**

Parallelization is another technique that can improve the efficiency of data wrangling tasks by breaking down the workload into smaller tasks that can be performed simultaneously. Libraries such as Dask and joblib provide parallelization capabilities for data wrangling tasks.

**🧹 5. Handle Missing Data:**

Missing data can cause issues during data wrangling and analysis. It is important to handle missing data appropriately, either by imputing values or removing observations. Pandas provides methods for handling missing data, such as fillna() and dropna().

**🤖 6. Automate Where Possible:**

Automation can help to streamline repetitive data wrangling tasks and reduce the potential for errors. Tools such as Python scripts and workflows in tools like Apache Airflow can help automate data wrangling processes.

**📊 7. Test and Validate:**

Testing and validating the data wrangling process is important to ensure accuracy and reliability of the final output. This can include checking for data consistency, confirming data types, and validating the final output against expectations.


Example of using vectorization in Pandas:

In [None]:
import pandas as pd

# Create a sample data frame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Multiply each element in column A by 2 using vectorization
df['A'] = df['A'] * 2

# Print the updated data frame
print(df)

   A  B
0  2  4
1  4  5
2  6  6


# **✅ Summary ✅**

👋 Hi students! Now that we've covered the importance of code optimization in data wrangling, techniques for optimization, best practices for efficient data wrangling, and strategies for parallel and distributed data wrangling using some wonderful emojis, let's review what we've learned. 📊

🤖 Code optimization is important in data wrangling as it can improve the performance of your code and speed up processing times.

🐍 Techniques for code optimization include using vectorized operations, avoiding loops, reducing memory usage, and parallel processing.

📚 Best practices for efficient data wrangling include understanding your data, keeping your code clean, and collaborating with others.

💻 Strategies for parallel and distributed data wrangling involve using multiple processors or distributed computing resources to speed up processing times and improve the performance of your code.

📈 By using these techniques and best practices, you can optimize your data wrangling code and make it more efficient, allowing you to work with large datasets and perform complex analyses more easily.

📝 It's important to keep in mind that code optimization is an ongoing process and requires continuous effort and improvement. By constantly seeking ways to improve your code and staying up-to-date with the latest techniques and tools, you can become an expert in data wrangling and analysis.

🐍 Keep practicing and applying these techniques and best practices, and you'll be well on your way to mastering data wrangling in no time!


# **➕ Additional Reading ➕**

###**Mnemonic**

👋 Hi there! Let me tell you the story of CoderWorlds, a data analytics company that used the concepts of understanding the importance of code optimization in data wrangling, techniques for code optimization, best practices for efficient data wrangling, and strategies for parallel and distributed data wrangling using some wonderful emojis.

📈 CoderWorlds specializes in data analytics and works with a wide range of clients from various industries. They understand the importance of code optimization in data wrangling and continuously seek ways to improve the performance of their code.

🐍 To achieve this, CoderWorlds uses techniques such as vectorized operations, avoiding loops, reducing memory usage, and parallel processing to optimize their code and speed up processing times.

📚 They also follow best practices for efficient data wrangling, such as understanding their data, keeping their code clean, and collaborating with other team members.

💻 In addition, CoderWorlds uses strategies for parallel and distributed data wrangling, such as using multiple processors or distributed computing resources, to further speed up processing times and improve the performance of their code.

🚀 Through their commitment to code optimization and efficient data wrangling practices, CoderWorlds has been able to provide their clients with high-quality data analysis services, enabling them to make better business decisions based on accurate and reliable data.

📝 By following in the footsteps of CoderWorlds and utilizing these concepts, you too can optimize your data wrangling code and improve the performance of your data analysis, making it easier to work with large datasets and perform complex analyses.

🐍 Keep practicing and applying these techniques and best practices, and you'll be well on your way to becoming an expert in data wrangling and analysis, just like CoderWorlds!


### **Best Practices/Tips**
🤔 Understand your data: Before you start working with your data, it's important to understand its structure, format, and any potential issues that may arise. This can help you optimize your code more effectively and avoid potential errors.

🐍 Use vectorized operations: Vectorized operations can perform multiple calculations simultaneously, making them much faster than looping over individual elements. Use them whenever possible to optimize your code and improve performance.

🧹 Keep your code clean: Writing clean, well-organized code can improve the readability of your code and make it easier to optimize. Use consistent naming conventions, comments, and indentation to keep your code organized and understandable.

📏 Reduce memory usage: Large datasets can quickly use up memory, leading to slow performance or even crashes. Use techniques such as data compression or chunking to reduce the amount of memory used by your code and improve its performance.

🤖 Use parallel processing: By using multiple cores or processors to run your code, you can speed up processing times and improve the performance of your code. Consider using libraries such as Dask or PySpark to enable parallel processing.

📚 Collaborate with others: Collaboration can help you identify potential issues with your code and find new ways to optimize it. Share your code with other team members and seek feedback to improve its performance.


###<b>Shortcomings</b>


🕰️ Time-consuming: Code optimization can be time-consuming, especially when working with large datasets. It can require significant effort and trial-and-error to find the most efficient code.

🤔 Requires expertise: Code optimization requires a deep understanding of programming concepts and techniques. It may require additional training or education to become proficient in these skills.

📈 Diminishing returns: There may be a point where further optimization doesn't significantly improve performance. It's important to balance optimization efforts with the amount of time and resources required.

🐌 Limited hardware: Some techniques for optimization, such as parallel processing, require specialized hardware. If you don't have access to this hardware, you may be limited in your optimization efforts.

💻 Compatibility issues: Some optimization techniques may not be compatible with all systems or libraries. It's important to test your code on multiple systems to ensure compatibility.
