# DS Workshop Day 1: Tips and Tricks 


## Welcome to this data science workshop by [GeeksHub](https://www.facebook.com/GeeksHUB.eg) !!! 
(check out our page for more details)  &#128064;


## 0. Quick Notes on Efficient use of Pandas 


*  selecting columns by name is faster than using `.iloc` but `.iloc` is still faster than `.loc`

*  random sampling using the pandas function is faster than using numpy to generate random indices

*  pandas `.replace()` performs about a 1000 times faster than using `.loc[CONDITION]=new_value`   and the more complex is the condition the slower is the intuitive way than the `.replace()` function

*  You can use one .replace() function for two replacing operations ,
    
    `data.column.replace([”old1”,  “old2”], [”new1”, “new2”], inplace=True)` 
    
    In this case with multiple values replaced in different ways or you want to perform the same replacement across multiple columns you can also use a dictionary instead which is even more efficient 
    
    `data.replace({”old1” : “new1”, “old2” : “new2”} , inplace=True)`
    
*  `.iterrows()` is not necessarily faster than range() but  produces a cleaner code


## 1. The differences between loc and iloc in Pandas 

## Using loc:

* loc is label-based indexing, which means you access data using row and column labels.

## Using iloc:

* iloc is integer-based indexing, which means you access data using integer positions.
* You specify rows and columns using their integer positions, starting from 0.

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# Setting the 'species' column as the index
iris.set_index('species', inplace=True)


In [3]:
# Example 1: Selecting specific rows and columns using loc
subset_loc = iris.loc[['setosa', 'versicolor'], ['sepal_length', 'petal_length']]
print("Using loc:")
print(subset_loc)

# Example 2: Selecting specific rows and columns using iloc
subset_iloc = iris.iloc[0:2, [0, 2]]
print("\nUsing iloc:")
print(subset_iloc)


Using loc:
            sepal_length  petal_length
species                               
setosa               5.1           1.4
setosa               4.9           1.4
setosa               4.7           1.3
setosa               4.6           1.5
setosa               5.0           1.4
...                  ...           ...
versicolor           5.7           4.2
versicolor           5.7           4.2
versicolor           6.2           4.3
versicolor           5.1           3.0
versicolor           5.7           4.1

[100 rows x 2 columns]

Using iloc:
         sepal_length  petal_length
species                            
setosa            5.1           1.4
setosa            4.9           1.4


In [4]:
# Example 3: Selecting rows based on conditions using loc
conditioned_data_loc = iris.loc[iris['sepal_width'] > 3.5]
print("\nRows based on condition using loc:")
print(conditioned_data_loc)

# Example 4: Selecting rows based on conditions using iloc (requires boolean indexing)
conditioned_data_iloc = iris.iloc[(iris['sepal_width'] > 3.5).values]
print("\nRows based on condition using iloc:")
print(conditioned_data_iloc)


Rows based on condition using loc:
           sepal_length  sepal_width  petal_length  petal_width
species                                                        
setosa              5.0          3.6           1.4          0.2
setosa              5.4          3.9           1.7          0.4
setosa              5.4          3.7           1.5          0.2
setosa              5.8          4.0           1.2          0.2
setosa              5.7          4.4           1.5          0.4
setosa              5.4          3.9           1.3          0.4
setosa              5.7          3.8           1.7          0.3
setosa              5.1          3.8           1.5          0.3
setosa              5.1          3.7           1.5          0.4
setosa              4.6          3.6           1.0          0.2
setosa              5.2          4.1           1.5          0.1
setosa              5.5          4.2           1.4          0.2
setosa              4.9          3.6           1.4          0.1
seto

## 2. The difference between List comprehensions and Loops

### There are some key differences:

* Readability and Conciseness: List comprehensions are more concise and readable for simple filtering and transformation operations

* Performance: List comprehensions can be slightly more efficient for simple operations because they are optimized in Python. For large datasets.

* Flexibility: Loops offer more flexibility when dealing with complex data transformations 

* Maintainability: List comprehensions are often preferred for their simplicity, making the code easier to maintain and debug.

## Using Loops 

In [5]:
# Sample sales data
sales_data = [120, 80, 150, 90, 200, 110, 130, 95, 160]

# Define a threshold
threshold = 100

# Filter and calculate total revenue using a for loop
filtered_sales = []
for sale in sales_data:
    if sale >= threshold:
        filtered_sales.append(sale)

total_revenue = sum(filtered_sales)
total_revenue

870

## Using List Comprehension 

In [6]:
# Sample sales data
sales_data = [120, 80, 150, 90, 200, 110, 130, 95, 160]

# Define a threshold
threshold = 100

# Filter and calculate total revenue using list comprehension
filtered_sales = [sale for sale in sales_data if sale >= threshold]
total_revenue = sum(filtered_sales)
total_revenue

870

##  3. Filtering and Grouping 

In [30]:
# Filter the data based on the condition
filtered_data = iris[(iris['sepal_length'] > 5.5) & (iris['petal_length'] > 1.5)]

print(filtered_data)

# Print the unique species that meet the filter condition
print('Species with sepal_length greater than 5.5:', filtered_data.index.unique())


            sepal_length  sepal_width  petal_length  petal_width
species                                                         
setosa               5.7          3.8           1.7          0.3
versicolor           7.0          3.2           4.7          1.4
versicolor           6.4          3.2           4.5          1.5
versicolor           6.9          3.1           4.9          1.5
versicolor           6.5          2.8           4.6          1.5
...                  ...          ...           ...          ...
virginica            6.7          3.0           5.2          2.3
virginica            6.3          2.5           5.0          1.9
virginica            6.5          3.0           5.2          2.0
virginica            6.2          3.4           5.4          2.3
virginica            5.9          3.0           5.1          1.8

[89 rows x 4 columns]
Species with sepal_length greater than 5.5: Index(['setosa', 'versicolor', 'virginica'], dtype='object', name='species')


In [20]:
# Group the data by species and calculate the mean for each species
grouped_iris = iris.groupby('species').mean()

# Select specific columns from the grouped data
selected_columns = ['sepal_length', 'petal_length']
subset = grouped_iris[selected_columns]

# Print the subset
print("Subset of mean values for 'sepal_length' and 'petal_length':")
print(subset)


Subset of mean values for 'sepal_length' and 'petal_length':
            sepal_length  petal_length
species                               
setosa             5.006         1.462
versicolor         5.936         4.260
virginica          6.588         5.552
