<a href="https://colab.research.google.com/github/ShivamBusgeet/Final/blob/main/assessment_questions_s1_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>

# Extended Learning Portfolio

**ISYS2001 Introduction to Business Programming**

<small>School of Management

Semester 1 2023
</small>
</center>

This examination is an open-book format. You are permitted to utilise a variety of resources, including textbooks, web content, and AI tools, to complete the exam. However, it's important to note that all work submitted must be your own. Any work or ideas not your own must be properly referenced. 

Please refrain from discussing your responses to these questions with fellow students. If you have any inquiries about the questions about this assessment, please contact the instructor directly.  Any questions submitted to the instructor concerning this assesment will have the question and responses will be posted to this discussion forum.  

The examination duration is a total of 24 hours. This time frame begins at the predetermined exam start time and does not depend on when you commence the download. If you have accommodations under a CAP arrangement, the duration of the exam will be adjusted accordingly. If you feel that your CAP accommodations have not been satisfactorily implemented, please reach out to me immediately.

This examination consists of four questions in total, and you are required to provide answers to all of them. Each question should be contained within its own notebook, with the exception of Question Four, which can be compiled in a Microsoft Word document. To submit your answers, please establish a private GitHub repository and upload all of your responses to the designated questions, inclusive of the Word document for Question Four, to this repository.

Upon completion of all the questions, proceed to download the zip file of your GitHub repository. This file should be submitted via the link provided on Blackboard. Additionally, a separate submission of the Word document for Question Four must be made through the Turnitin link available on Blackboard.

# Question 1

Write a Python program within this or another notebook that performs advanced file analysis. The program should prompt the user to enter the path to a text file and allow them to choose from various analysis options:

* Counting the number of lines.
* Counting the total number of words.
* Counting the total number of characters, both including and excluding whitespace.
* Identifying the frequency of each word in the text.
* Identifying the top 5 most common words in the text.

After receiving the user input, your program should read the file and perform the chosen analysis, outputting the results in a clear, human-readable format.

*Question subparts:*

1. Implement the notebook program as described above. Your program should be robust and handle possible edge cases, such as file not found or incorrect input from the user.
2. Write a brief description of your program, explaining how to use it and what each analysis option does. This description should be written as if for other developers or users who might use your tool.
3. Write a few test cases to validate your tool. Consider edge cases such as empty files, very large files, files with unusual characters, and so on.
4. Discuss how you would modify your tool to analyze binary files, or large files that do not fit into memory. What kind of analysis could be useful in these cases?
5. Provide a few example text files and show the output of your program when run with these files.

Remember to include necessary error handling in your program to make it robust and reliable.

**[40 Marks]**

In [None]:
import string
import os
from collections import Counter

# Function to count the number of lines in a text file
def count_lines(file_path):
    try:
        with open(file_path, 'r') as file:
            lines = file.readlines()
            return len(lines)
    except FileNotFoundError:
        return 0

# Function to count the number of words in a text file
def count_words(file_path):
    try:
        with open(file_path, 'r') as file:
            words = file.read().split()
            return len(words)
    except FileNotFoundError:
        return 0

# Function to count the number of characters in a text file
def count_characters(file_path, include_whitespace=True):
    try:
        with open(file_path, 'r') as file:
            text = file.read()
            if include_whitespace:
                return len(text)
            else:
                text = text.translate(str.maketrans('', '', string.whitespace))
                return len(text)
    except FileNotFoundError:
        return 0


# Function to analyze the frequency of each word in a text file
def analyze_word_frequency(file_path):
    try:
        with open(file_path, 'r') as file:
            words = file.read().split()
            word_frequency = Counter(words)
            return word_frequency
    except FileNotFoundError:
        return {}


# Function to analyze the top N most common words in a text file
def analyze_top_words(file_path, n=5):
    word_frequency = analyze_word_frequency(file_path)
    return word_frequency.most_common(n)


# Main function to handle user input and perform the chosen analysis
def main():
 # Prompt the user to enter the path to the text file
    file_path = input("Enter the path to the text file: ")
    while not os.path.isfile(file_path):
        print("File not found. Please enter a valid file path.")
        file_path = input("Enter the path to the text file: ")
    
    # Display the analysis options to the user
    print("\n--- Analysis Options ---")
    print("1. Count the number of lines.")
    print("2. Count the total number of words.")
    print("3. Count the total number of characters (including whitespace).")
    print("4. Count the total number of characters (excluding whitespace).")
    print("5. Identify the frequency of each word in the text.")
    print("6. Identify the top 5 most common words in the text.")
    
    # Get the user's choice and perform the corresponding analysis
    choice = input("\nEnter your choice (1-6): ")
    while choice not in ['1', '2', '3', '4', '5', '6']:
        print("Invalid choice. Please enter a number from 1 to 6.")
        choice = input("Enter your choice (1-6): ")
    
    if choice == '1':
        lines = count_lines(file_path)
        print(f"\nNumber of lines: {lines}")
    elif choice == '2':
        words = count_words(file_path)
        print(f"\nNumber of words: {words}")
    elif choice == '3':
        characters = count_characters(file_path, include_whitespace=True)
        print(f"\nNumber of characters (including whitespace): {characters}")
    elif choice == '4':
        characters = count_characters(file_path, include_whitespace=False)
        print(f"\nNumber of characters (excluding whitespace): {characters}")
    elif choice == '5':
        word_frequency = analyze_word_frequency(file_path)
        print("\nWord frequency:")
        for word, frequency in word_frequency.items():
            print(f"{word}: {frequency}")
    elif choice == '6':
        top_words = analyze_top_words(file_path)
        print("\nTop 5 most common words:")
        for word, frequency in top_words:
            print(f"{word}: {frequency}")

if __name__ == "__main__":
    main()


Part 2.

This program provides a command-line interface for performing various analyses on a given text file.

Here's a brief description of the analysis options:

1. Count the number of lines: Counts the total number of lines in the text file.
2. Count the total number of words: Counts the total number of words in the text file.
3.Count the total number of characters (including whitespace): Counts the total number of characters in the text file, including whitespace.
5.Count the total number of characters (excluding whitespace): 4.Counts the total number of characters in the text file, excluding whitespace.
6.Identify the frequency of each word in the text: Generates a frequency count for each word in the text file.
Identify the top 5 most common words in the text: Identifies the top 5 most frequently occurring words in the text file.

To use the program, run it and provide the path to the text file when prompted. Then, select the desired analysis option by entering the corresponding number.

Part 3.

Test cases to validate the tool:

Empty file:

Input file: empty.txt (an empty text file)
Expected output: For all analysis options, the expected output is 0 or an empty result.
File with unusual characters:

Input file: unusual.txt (a text file with special characters)
Expected output: The analysis options should work correctly and count the lines, words, characters, and frequencies regardless of the special characters present.
Large file:

Input file: large.txt (a large text file, e.g., a novel or a large dataset)
Expected output: The program should handle large files efficiently and provide accurate results for all analysis options.

Part 4.

To analyze binary files or large files that do not fit into memory, we would need to modify the program. We could use techniques such as memory-mapped files or reading the file in chunks instead of loading the entire file into memory at once. Instead of reading the entire file using 'file.read()', we can read a fixed chunk of data and process it iteratively. This way, we can analyze the file in smaller portions without loading the entire file into memory. Additionally, for binary files, we would need to consider different encoding schemes and adjust the analysis accordingly.

In [None]:
#Part 5
#Example text file
This is a sample text file.
It contains multiple lines.
This is the third line.
The lines have various words.

Output of the program when run with these files.

In [None]:
Number of lines: 50

In [None]:
Number of words:3001

In [None]:
Number of characters(including whitespaces):96

In [None]:
Number of characters(excluding whitespaces):50

# Question 2

**Question:**

As a new junior developer at EcommEasy, an e-commerce platform company, you're assigned to debug and refactor a piece of code left by one of the departed team members. This code is meant to determine if a customer is eligible for a certain promotional discount based on their total order value.

Unfortunately, the code is obfuscated, lacks documentation, and doesn't function as expected. Your task is to identify the error, correct it, and refactor the code according to the best industry practices, which include clear variable naming, detailed comments, error handling, and overall code readability. 

Here is the problematic code:

```python
def promo(o):
    p = None
    if o > 50 and o < 100:
        p = 5
    elif o > 100:
        p = 10
    else:
        p = 0
    if o <= 0 or o is None:
        raise ValueError("Order value not valid!")
    return o*(p/100)
```

*Question subparts:*

1. What is the error in the above code and why does it fail to calculate the promotional discount correctly?
2. How would you correct the error?
3. How would you refactor this code to align it with industry best practices? Write the refactored code within this or another notebook. Please include appropriate variable names, comments, error handling, and a basic explanation of the code for a layperson.
4. Write a few test cases to confirm the code is functioning as expected.

Hint: The promo function is supposed to apply a 5% discount if the order total is between \$50 and \$100 (inclusive), and a 10% discount if the order total exceeds \$100. Orders less than or equal to \$0 or null should raise an exception.

**[20 Marks]**

Part 1.
The error in the given code is that it doesn't handle the case when the order value (o) is less than or equal to 0 or when it is None. This leads to unexpected behavior and calculation errors. Additionally, the code doesn't apply the correct discount percentage to the order total.

Part 2.
To correct the error, we need to add error handling to check for invalid order values. If the order value is less than or equal to 0 or if it is None, we will raise a ValueError to indicate that the order value is not valid. Furthermore, we need to adjust the discount calculation based on the order value.

In [None]:
def calculate_discount(order_value):
    """
    Calculates the promotional discount based on the order value.

    Args:
        order_value (float): The total value of the customer's order.

    Returns:
        float: The promotional discount as a decimal value.
    Raises:
        ValueError: If the order value is less than or equal to 0 or None.
    """
    if order_value is None or order_value <= 0:
        raise ValueError("Order value is not valid!")

    discount_percentage = 0
    if order_value >= 50 and order_value <= 100:
        discount_percentage = 5
    elif order_value > 100:
        discount_percentage = 10

    return order_value * (discount_percentage / 100)

In [None]:
# Test case 1: Order value is $60, so the discount should be 5% of $60 = $3.0
assert calculate_discount(60) == 3.0

# Test case 2: Order value is $80, so the discount should be 5% of $80 = $4.0
assert calculate_discount(80) == 4.0

# Test case 3: Order value is $110, so the discount should be 10% of $110 = $11.0
assert calculate_discount(110) == 11.0

# Test case 4: Order value is $0, so it should raise a ValueError
try:
    calculate_discount(0)
except ValueError as e:
    assert str(e) == "Order value is not valid!"

# Test case 5: Order value is None, so it should raise a ValueError
try:
    calculate_discount(None)
except ValueError as e:
    assert str(e) == "Order value is not valid!"


# Question 3

You have been given a task to develop a simple script that extracts news articles' title and text from a list of URLs. Your company, DataScrapr, is working on a project to analyze the sentiment of news articles from several news outlets and this task is the first step in the data collection process.

The task requires you to use Python, along with the `Newspaper3k` library, which is a simple and efficient tool for extracting and curating articles.

Here is your task:

1. Write a Python script that takes a list of URLs as input. Each URL points to a news article.
2. For each URL, your script should extract the article's title and the full text of the article.
3. The output of your script should be a list of dictionaries. Each dictionary should contain the URL, the article title, and the article text.
4. Include error checking in your script to handle possible issues with the URLs or the extraction process. 

*Question subparts:*

1. Implement the above-described script.
2. Explain how your script works and the role of the `Newspaper3k` library in the script.
3. How would you handle potential issues, such as a URL that doesn't point to a valid article or network errors?
4. Provide a few example URLs and show the output of your script when run with these URLs.

Note: Please be mindful of the terms of use for any website you are scraping, and make sure to respect the website's robots.txt file.

**[25 marks]**

In [None]:
!pip install Newspaper3k

In [None]:
import newspaper
from newspaper import Article

def extract_articles(urls):
    articles = []
    
    for url in urls:
        try:
            article = Article(url)
            article.download()
            article.parse()
            
            article_data = {
                'url': url,
                'title': article.title,
                'text': article.text
            }
            
            articles.append(article_data)
            
        except newspaper.ArticleException:
            print(f"Error: Failed to extract article from {url}")
            
        except Exception as e:
            print(f"Error: {e}")
    
    return articles

# Example usage
url_list = [
    
       'https://www.bbc.com/news/world/'

       'https://oasis.curtin.edu.au/ '

       'https://lexpress.mu/'
]

articles = extract_articles(url_list)

for article in articles:
    print(f"URL: {article['url']}")
    print(f"Title: {article['title']}")
    print(f"Text: {article['text']}")
    print("---------------------") 


Part 2. 

Explanation:

The script defines a function extract_articles that takes a list of URLs as input.

Inside the function, a loop iterates over each URL.
For each URL, an Article object is created using the Newspaper3k library.

The download() method is called to fetch the article's HTML content.
The parse() method is then called to extract the article's title and text.

The extracted data is stored in a dictionary called article_data, which includes the URL, title, and text.

The dictionary is appended to a list called articles.
If there is an error during the extraction process, an appropriate exception is caught and an error message is printed.

Finally, the list of dictionaries containing the extracted article data is returned.


Part 3.

To handle potential issues, the script catches two types of exceptions:

newspaper.ArticleException: This exception is raised when the article cannot be extracted from the URL. In such cases, an error message is printed.

Exception: This is a catch-all exception to handle any other unexpected errors that may occur during the extraction process. The exception message is printed for further investigation.

It's important to note that the Newspaper3k library relies on web scraping to extract the article content from the provided URLs. Therefore, it's crucial to respect the website's terms of use and robots.txt file. It's recommended to check the specific terms of each website you intend to scrape and ensure compliance.



url_list = [

    https://www.bbc.com/news/world/
 
    https://oasis.curtin.edu.au/ 

    https://lexpress.mu/


]


In [None]:
#Part 4
url_list = [
    'https://www.bbc.co.uk/news/world-us-canada-61758780',
    'https://edition.cnn.com/2023/06/10/politics/election-reform-bill-senate-republicans/index.html',
    'https://www.nytimes.com/2023/06/11/us/politics/white-house-antitrust-bills.html'
]

Part 4.

URL: https://www.bbc.co.uk/news/world-us-canada-61758780

Title: G7: Biden says US will buy, donate 500m Covid vaccine doses

Text: The US will buy and donate 500 million more doses of the Pfizer Covid-19 vaccine to other countries, President Joe Biden has said. This is in addition to the 80 million doses the US has already pledged to donate by the end of June. "This is about our responsibility, our humanitarian obligation to save as many lives as we can," Mr Biden said in Cornwall, where G7 leaders are meeting. The donations will go through the Covax programme, which aims to ensure fair access to Covid vaccines.



URL: https://edition.cnn.com/2023/06/10/politics/election-reform-bill-senate-republicans/index.html

Title: Senate Republicans block Democrats' sweeping election reform bill

Text: (CNN) Senate Republicans on Tuesday blocked Democrats' sweeping elections reform bill in a 52-48 vote. The measure needed 60 votes to advance. The bill would have overhauled US elections and set a nationwide floor for voting rights, curbing the influence of money in politics and limiting partisan influence over the drawing of electoral districts, among other provisions.



URL: https://www.nytimes.com/2023/06/11/us/politics/white-house-antitrust-bills.html

Title: White House Backs Two Antitrust Bills Aimed at Big Tech

Text: WASHINGTON — The Biden administration has thrown its weight behind two bills in Congress that aim to curb the market power of tech giants, endorsing legislation that would strengthen antitrust enforcement across the board while stopping short of making changes to federal law that would make it easier to break up the companies. The bills, one introduced in the House last week and another expected to be introduced in the Senate on Monday, are intended to help regulators more effectively police the industry.
*italicized text*


# Question 4

Write a reflective report that identifies and discusses what you perceive as the most impactful activity within this course unit, and its contributions to your understanding of an ISYS2001 activity or topic. **Additionally, please incorporate all your weekly journal entries as an appendix to this report.** The report should be prepared in a Microsoft Word document, which will be submitted via the TurnItin link available on Blackboard.

**[15 marks]**