# Most Frequently Ordered Items

Description

As a data analyst at PacFood, a food and beverage company, you are responsible for analyzing customer orders to determine the most frequently ordered items. Your goal is to process customer orders and create a summary of the top items ordered.

Each order is represented as a string in a list, where each string contains the name of the ordered item. You need to gather this information and identify the most popular items based on the number of orders.

Instructions:

- You will receive a list of customer orders as input.

- Use a loop (e.g., for loop) to iterate through the list of orders and count the occurrences of each item ordered.

- Maintain a dictionary to count the occurrences of each item ordered.

- After processing all orders, determine the item with the highest count and store the results in a dictionary called top_ordered_item, containing:

    - The item name as the key.

    - The number of times it was ordered as the value.

In [None]:
# Expected Output

# {'Pizza': 4}

In [5]:
# Don't change code below
orders = [
    "Pizza",
    "Pasta",
    "Salad",
    "Pizza",
    "Soda",
    "Pizza",
    "Fries",
    "Salad",
    "Pizza"
]

# Your code here
dictionary = {}
for i in orders:
    if i not in dictionary:
        dictionary[i] = 1
    elif i in dictionary:
        dictionary[i] += 1

most_ordered_item = max(dictionary, key=dictionary.get)
most_ordered_value = dictionary[most_ordered_item]
top_ordered_item = {most_ordered_item : most_ordered_value}
print(top_ordered_item)

{'Pizza': 4}


# Delivery ETA Calculation

Description

You work as a data analyst at PacFood, a company focused on optimizing food delivery times. Accurate ETA (Estimated Time of Arrival) calculations are critical in managing customer expectations and satisfaction. In this task, your goal is to compute the ETA for multiple food deliveries based on the following time components:

- The time it takes for the driver to reach the restaurant.

- The time it takes for the restaurant to prepare the food.

- The time it takes for the driver to deliver the food to the customer.

Since the restaurant can start cooking while the driver is still en route, you need to adjust the total ETA to account for this overlap.

Instructions:

- A nested list will be given as input representing the times involved for each delivery:

    - The first element represents the time it takes for the driver to reach the restaurant.

    - The second element represents the time the restaurant takes to cook the food.

    - The third element represents the time it takes for the driver to deliver the food to the customer.

- Extract the time components from each nested list: driver-to-restaurant time, cooking time, and delivery time.

- Calculate the ETA for each delivery:

    - Add the driver’s time to reach the restaurant and the driver’s time to deliver the food to the customer.

    - If the restaurant’s cooking time exceeds the driver’s time to reach the restaurant, add the remaining cooking time to the ETA.

- Store the result for all deliveries in a list called estimated_eta.

In [7]:
# Expected Output

# [35, 49, 43]

In [8]:
# Don't change code below
delivery_times = [[7, 22, 13], [33, 24, 16], [11, 17, 26]]

# Your code here
estimated_eta = []
for i in delivery_times:
    driver_to_restaurant = i[0]
    cooking_time = i[1]
    delivery_time = i[2]
    if cooking_time > driver_to_restaurant:
        exceed_time = cooking_time - driver_to_restaurant
        eta = driver_to_restaurant + exceed_time + delivery_time
    else:
        eta = driver_to_restaurant + delivery_time
    
    estimated_eta.append(eta)
        
print(estimated_eta)

[35, 49, 43]


# Montly Churn Rate Calculation

Description

You work as a data analyst at PacFlix, a video streaming service with millions of customers. Understanding churn rate, the percentage of customers who stop using the service, is critical for maintaining customer retention.

In this task, your goal is to calculate the monthly churn rate for PacFlix based on customer data.

Instructions

- A list of customer_in and customer_out will be provided for each month.

    - initial_total_customer represents the number of total customers at the beginning of the period.

    - customer_in represents the number of new customers who joined in that month.

    - customer_out represents the number of customers who left the service (churned) during that month.

- Calculate the churn rate for each month using the formula:

$$
Churn Rate = \frac{C_{Left}}{C_{Start} + C_{New}}
$$

Where:

$C_{Left}$ : Number of customers who left

$C_{Start}$ : Total customers at the start of the period

$C_{New}$ : New customers acquired during the period

- Update the total customer count at the beginning of each period by including new customers and excluding churned customers from the previous period.

- Store the churn rate for each month in a list called churn_rates. Round each churn rate to 4 decimal places before storing it.

In [None]:
# Expected Output

# [0.0218, 0.0077, 0.0187, 0.0124, 0.0141]

In [9]:
# Don't change code below
customer_in = [100, 350, 300, 500, 320]
customer_out = [24, 11, 32, 27, 35]
initial_total_customer = 1000

# Your code here
churn_rates = []
c_start = initial_total_customer
for i in range(len(customer_in)):
    c_new = customer_in[i]
    c_left = customer_out[i]
    churn_rate = round(c_left / (c_start + c_new), 4)
    churn_rates.append(churn_rate)
    c_start += c_new - c_left
    
print(churn_rates)

[0.0218, 0.0077, 0.0187, 0.0124, 0.0141]


# Loan Default Prediction Model Evaluation

Description

You work as a data scientist at PacHome, a company that offers loan products to millions of customers. One of your key responsibilities is to reduce the risk of approving loans to customers who are likely to default. To achieve this, several predictive models are built to estimate the likelihood of a customer defaulting on their loan. Your task is to evaluate each model using the recall metric and identify the best-performing model.

The recall metric is particularly important in this context because it measures how well the model identifies all actual loan defaulters. A high recall score means the model minimizes the risk of approving loans to high-risk customers.

Instructions

- A dictionary will be provided as input containing the following keys:

    - 'True Label': A list representing whether a customer defaulted or not.

    - 'Model 1': A list representing the predicted labels for Model 1.

    - 'Model 2': A list representing the predicted labels for Model 2.

- Each list contains binary values:

    - 1 represents a customer who defaulted.

    - 0 represents a customer who did not default.

- For each model, calculate the recall using the formula:

    Recall = ( True Positives / (True Positives + False Negatives) )
 
    - True Positives (TP): The number of customers correctly predicted to default.

    - False Negatives (FN): The number of customers who defaulted but were predicted not to default.

- Store the recall scores for each model in a dictionary called recall_scores, where the keys are "Model <number>" and the values are the calculated recall scores.

- Identify the best model by selecting the one with the highest recall score and store the result in best_model.

- Print the best model in the format:

    Best model {best_model} with recall score: {recall_score[best_model]}

In [None]:
# Expected Output:

# Best model Model 2 with recall score: 0.8

In [10]:
# Don't change code below
input_data = {
    'True Label': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0],
    'Model 1': [1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
    'Model 2': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
}

# Your code here
recall_scores = {}
for model in ['Model 1','Model 2']:
    true_positives = 0
    false_negatives = 0
    for true_label, predicted in zip(input_data['True Label'], input_data[model]):
        if predicted == true_label:
            true_positives += 1
        elif predicted < true_label:
            false_negatives += 1
            
        recall = true_positives / (true_positives + false_negatives)
        recall_scores[model] = round(recall, 2)
       
best_model = max(recall_scores, key=recall_scores.get)
print(f'Best model {best_model} with recall score: {recall_scores[best_model]}')

Best model Model 2 with recall score: 0.89


# Stopwords Removal from News Articles

Description

As a data engineer at PacMedia, you are responsible for processing a list of digital news articles in Indonesian by removing stopwords to prepare the data for further analysis. Your goal is to create a clean list of articles without these common words that do not contribute much meaning.

Instructions

- You will receive a list of strings, where each string represents a news article in Indonesian.

- You will also be given a list of stopwords that need to be removed from the articles.

- Use a loop to iterate through each article in the list.

- For each article, split the text into individual words, check each word against the stopwords list, and create a new list containing only the words that are not stopwords.

- After processing all articles, join the remaining words back into a single string for each article.

- Store the final cleaned list of articles in a variable called cleaned_articles.

In [3]:
# Expected Output

# ['Perubahan iklim mempengaruhi pola cuaca secara global.', 
# 'Smartphone baru memiliki teknologi desain canggih.', 
# 'Para ilmuwan menemukan spesies baru hutan hujan Amazon.']

In [4]:
# Don't change code below
articles = [
    "Perubahan iklim itu mempengaruhi pola cuaca secara global.",
    "Smartphone baru memiliki sebuah teknologi dan desain canggih.",
    "Para ilmuwan menemukan spesies baru di hutan hujan Amazon."
]

stopwords = [
    "yang", "adalah", "dari", "di", "ke", "dan", "dapat", "itu", "sebuah"
]

# Your code here
cleaned_articles = []
for i in articles:
    articles_split = i.split()
    articles_list = []
    for j in articles_split:
        if j not in stopwords:
            articles_list.append(j)
        
    article = " ".join(articles_list)
    cleaned_articles.append(article)
            
print(cleaned_articles)

['Perubahan iklim mempengaruhi pola cuaca secara global.', 'Smartphone baru memiliki teknologi desain canggih.', 'Para ilmuwan menemukan spesies baru hutan hujan Amazon.']


# E-commerce Data Imputation

Description

As a data engineer at PacCart, an e-commerce platform specializing in online shopping, you are responsible for ensuring the quality and integrity of the product data. One of the common challenges in managing product information is handling missing values, which can hinder the analysis of product performance. Your goal is to perform imputation on a given dataset containing massing values to create a complete dataset for analysis.

Instructions

- You will receive a list of dictionaries representing the product dataset, where each dictionary contains the following fields:

    - product_id: A unique identifier for each product.
    
    - price: The price of the product (can be None indicating a missing value).
    
    - stock: The number of items in stock (can also be None indicating a missing value). This should be an integer.

    - rating: A numerical rating of the product (should never be missing).

- Use the following imputation strategy:

    - If price is missing, replace it with the average of the non-missing values of price.

    - If stock is missing, replace it with the average of the non-missing values of stock.

    - rating should remain unchanged since it should never be missing.

- Calculate the average of price (float with 2 decimal places) and average of stock (integer), while ignoring missing values.

- Use a loop to iterate through each product in the dataset:

    - Check for missing values and apply the appropriate imputation.

- Store the clean dataset in a list of dictionaries called completed_data.

In [2]:
# Expected Output

# [{'product_id': 105, 'price': 43.99, 'stock': 225, 'rating': 4.5}, 
# {'product_id': 106, 'price': 28.16, 'stock': 225, 'rating': 4.2}, 
# {'product_id': 107, 'price': 28.16, 'stock': 150, 'rating': 4.7}, 
# {'product_id': 108, 'price': 12.34, 'stock': 300, 'rating': 4.4}]

In [1]:
# Don't change code below
data = [
    {'product_id': 105, 'price': 43.99, 'stock': None, 'rating': 4.5},
    {'product_id': 106, 'price': None, 'stock': None, 'rating': 4.2},
    {'product_id': 107, 'price': None, 'stock': 150, 'rating': 4.7},
    {'product_id': 108, 'price': 12.34, 'stock': 300, 'rating': 4.4}
]

# Your code here
prices = 0 
stocks = 0 
for product in data:
    prod = 1
    if product['price'] != None:
        prod += 1
        prices += product['price']
        average_prices = prices / prod
    prods = 1
    if product['stock'] != None:
        prods += 1
        stocks += product['stock']
        average_stocks = stocks / prods

completed_data = []
for product in data:
    if product['price'] == None:
        product['price'] = round(average_prices, 2)
        
    if product['stock'] == None:
        product['stock'] = int(average_stocks)
        
    completed_data.append(product)
    
print(completed_data)

[{'product_id': 105, 'price': 43.99, 'stock': 225, 'rating': 4.5}, {'product_id': 106, 'price': 28.16, 'stock': 225, 'rating': 4.2}, {'product_id': 107, 'price': 28.16, 'stock': 150, 'rating': 4.7}, {'product_id': 108, 'price': 12.34, 'stock': 300, 'rating': 4.4}]


# Price Change Analysis

Task Description

You work as a data scientist at PacFin, a company specializing in financial analytics. Your task is to analyze stock prices to identify days with significant price changes, which could indicate market movements. Significant price changes are defined as daily price changes that are greater than a specified percentage.

You need to create a function that calculates the percentage change in stock prices from one day to the next, then identifies the days where the price change exceeds a specified threshold.

Instructions

- Define a function identify_price_changes that takes two inputs:

    - A list of stock prices (numerical values representing daily closing prices).

    - A percentage threshold (a numerical value that represents the threshold for identifying significant price changes).

- For each day in the list (starting from the second day), calculate the percentage change from the previous day's closing price using the formula:

    Percentage Change = ( (Current Price − Previous Price) / Previous Price ) × 100

- Identify days where the absolute percentage change exceeds the specified threshold.

- Return a list of strings representing the days where significant price changes occurred in the format "Day X", where X is the day number (starting from Day 2 for the first change).

In [1]:
# Expected Output:

# ['Day 3', 'Day 4']

In [4]:
# Your code here
def identify_price_changes(stock_prices, threshold):
    day_changes = []
    for i in range(1, len(stock_prices)):
        current_price = stock_prices[i]
        previous_price = stock_prices[i-1]
        percentage_change = ((current_price - previous_price) / previous_price) * 100
        if abs(percentage_change) > threshold:
            day_changes.append(f'Day {i+1}')
        
    return day_changes
    
stock_prices = [100, 105, 98, 120, 115, 112]
threshold = 5 
result = identify_price_changes(stock_prices, threshold)
print(result)

['Day 3', 'Day 4']


# Search Ranking

Description

You are a data analyst at PacNews, where you are tasked with analyzing a list of articles to identify the most popular words based on their frequency. Your goal is to create a function that takes a list of articles (as a single string) and an intege N as input and returns the top most frequently occurring words along with their counts. If multiple words have the same frequency, they should be sorted lexicographically in descending order.

Instructions

- Define a function top_popular_words that accepts two parameters:

    - A string article, which contains the text of the article.

    - An integer N (default value of 3) representing the number of top words to return.

- Process the input to:

    - Convert the article text to lowercase to ensure case insensitivity.

    - Split the article into individual words, removing any punctuation (. , ! ? ; : " - _).

- Count the frequency of each word in the article.

- Sort the words first by their frequency in descending order, and then lexicographically in descending order if frequencies are the same.

- Return a list of tuples, where each tuple contains the word and its frequency count, for the top N words based on the sorting criteria.

In [None]:
# Expected Output:

# [('data', 4), ('is', 2), ('decision', 1)]

In [6]:
# Your code here
def top_popular_words(article, N=3):
    article_lower = article.lower()
    article_replace_1 = article_lower.replace('.', '')
    article_replace_2 = article_replace_1.replace(',', '')
    article_replace_3 = article_replace_2.replace('-', ' ')
    article_split = article_replace_3.split()
    dictionary = {}
    for word in article_split:
        if word not in dictionary:
            dictionary[word] = 1 
        else: 
            dictionary[word] += 1 
            
    dict_items = list(dictionary.items())
    dict_items = sorted(dict_items, key=lambda word: (-word[1], word[0]))
    
    return dict_items[:N]
    
article = 'In data science, data is the new oil. Data is essential for data-driven decision-making.'
result = top_popular_words(article, 3)
print(result)

[('data', 4), ('is', 2), ('decision', 1)]


# Clean Email Addresses

Description

As a data analyst at PacMail, an email marketing company, you are responsible for ensuring the accuracy and validity of customer email addresses. Some email addresses in your dataset have extra spaces, invalid formats, or incorrect case sensitivity. Your task is to create a function that cleans these email addresses by removing unnecessary whitespace and converting them to lowercase.

Instructions

- Define a function clean_email_addresses that takes a list of email addresses as input.

- For each email in the list:

    - Remove any leading or trailing whitespace.

    - Convert the email address to lowercase.

    - Check if the email contains an "@" symbol to ensure it is valid.

    - Ensure that the domain part (after the "@") follows a valid format:

    - The domain name must contain only letters (domains with numbers are invalid).

    - The top-level domain (TLD) should be a valid one like .com, .org, or .net.
    
- Return a list of cleaned email addresses, sorted in ascending order.

In [None]:
# Expected Output:

# ['alice@goalmail.com', 'bob@polyson.com', 'frizt@live.org']

In [None]:
# Your code here
def clean_email_addresses(emails):
    list_of_emails = []
    tlds = ['.com', '.org', '.net']

    for email in emails:
        email = email.strip().lower() 
        first_at = email.find('@')
        second_at = email.find('@', first_at + 1)
        
        if second_at != -1:
            continue

        if first_at == -1 or first_at == 0 or first_at == len(email) - 1:
            continue

        if '.' not in email[first_at:]:
            continue

        domain_part = email[first_at + 1:]
        if domain_part[0].isdigit():
            continue

        valid_tld = False
        for tld in tlds:
            if email.endswith(tld):
                valid_tld = True
                break

        if not valid_tld:
            continue

        list_of_emails.append(email)

    return list_of_emails


emails = [
    '   Alice@goalmail.com      ',
    'bob@polyson.com',
    '  CHARLIE@mail',
    'dave@@goalmail.net',
    '  erliza@3dimension.com ',
    'frizt@live.org  ',
    'george@goalmail.connect'
]
result = clean_email_addresses(emails)
print(result)


['alice@goalmail.com', 'bob@polyson.com', 'frizt@live.org']


# E-Tilang Vehicle Plate 

Description

You work in the IT division of the police department and need to develop an e-tilang (electronic traffic ticketing) program. This program will analyze vehicle license plate numbers to determine whether the last digit of each plate's serial number is odd or even. This classification will assist in managing traffic regulations.

Your task is to create a function named plate_extract that accepts a list of vehicle license plate numbers and returns a list of results indicating if each plate is valid and if its serial number is odd or even.

Instructions

- Define a function plate_extract that takes a list of vehicle license plate numbers as input.

- Return an empty list if the input list is empty.

- For each license plate:

    - Validate the format:

        - Start with 1 or 2 letters (region code).

        - Followed by exactly 4 digits (serial number).

        - End with whitespace and 1 or 2 letters (letter combination).

    - If valid, extract the last digit of the serial number:

        - Return "Even" if it’s even (0, 2, 4, 6, 8).

        - Return "Odd" if it’s odd (1, 3, 5, 7, 9).

    - If invalid, return "Invalid Plate".
    
        - Return the results list for each plate.

In [None]:
# Expected Output - 1:
# ['Even', 'Invalid Plate', 'Invalid Plate', 'Odd']

# Expected Output - 2:
# ['Even', 'Even', 'Even', 'Even']

In [None]:
# Your code here
def plate_extract(plate_numbers):
    output = []
    for plate in plate_numbers:
        parts = plate.split()
        if len(parts)<2 or len(parts)>4:
            output.append('Invalid Plate')
            continue
        if not (1<=(len(parts[0])<3) and (parts[0].isalpha()) and parts[0].isupper()):
            output.append('Invalid Plate')
            continue
        if not ((1<=len(parts[1])<=4) and (parts[1].isdigit())):
            output.append('Invalid Plate')
            continue
        if not ((1<=len(parts[-1])<=2) and (parts[-1].isalpha()) and parts[-1].isupper()):
            output.append('Invalid Plate')
            continue
        last_digit = int(parts[1][-1])
        output.append('Even' if last_digit % 2 == 0 else 'Odd')
        
    return output
    
plate_numbers = ['B 1230 XY', 'DKS 1234 S', 'D 123', 'BA 4567 T']
# plate_numbers = ['D 5678 AB', 'FG 3456 GH', 'H 7890 J', 'B 1234 ZY']
result = plate_extract(plate_numbers)
print(result)

['Even', 'Invalid Plate', 'Invalid Plate', 'Odd']
