<a href="https://colab.research.google.com/github/Niitesh122/textsummariser/blob/main/Niitesh_19623085_Assessment_s1_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>

# Extended Learning Portfolio

**ISYS2001 Introduction to Business Programming**

<small>School of Management

Semester 1 2023
</small>
</center>

This examination is an open-book format. You are permitted to utilise a variety of resources, including textbooks, web content, and AI tools, to complete the exam. However, it's important to note that all work submitted must be your own. Any work or ideas not your own must be properly referenced.

Please refrain from discussing your responses to these questions with fellow students. If you have any inquiries about the questions about this assessment, please contact the instructor directly.  Any questions submitted to the instructor concerning this assesment will have the question and responses will be posted to this discussion forum.

The examination duration is a total of 24 hours. This time frame begins at the predetermined exam start time and does not depend on when you commence the download. If you have accommodations under a CAP arrangement, the duration of the exam will be adjusted accordingly. If you feel that your CAP accommodations have not been satisfactorily implemented, please reach out to me immediately.

This examination consists of four questions in total, and you are required to provide answers to all of them. Each question should be contained within its own notebook, with the exception of Question Four, which can be compiled in a Microsoft Word document. To submit your answers, please establish a private GitHub repository and upload all of your responses to the designated questions, inclusive of the Word document for Question Four, to this repository.

Upon completion of all the questions, proceed to download the zip file of your GitHub repository. This file should be submitted via the link provided on Blackboard. Additionally, a separate submission of the Word document for Question Four must be made through the Turnitin link available on Blackboard.

# Question 1

Write a Python program within this or another notebook that performs advanced file analysis. The program should prompt the user to enter the path to a text file and allow them to choose from various analysis options:

* Counting the number of lines.
* Counting the total number of words.
* Counting the total number of characters, both including and excluding whitespace.
* Identifying the frequency of each word in the text.
* Identifying the top 5 most common words in the text.

After receiving the user input, your program should read the file and perform the chosen analysis, outputting the results in a clear, human-readable format.

*Question subparts:*

1. Implement the notebook program as described above. Your program should be robust and handle possible edge cases, such as file not found or incorrect input from the user.
2. Write a brief description of your program, explaining how to use it and what each analysis option does. This description should be written as if for other developers or users who might use your tool.
3. Write a few test cases to validate your tool. Consider edge cases such as empty files, very large files, files with unusual characters, and so on.
4. Discuss how you would modify your tool to analyze binary files, or large files that do not fit into memory. What kind of analysis could be useful in these cases?
5. Provide a few example text files and show the output of your program when run with these files.

Remember to include necessary error handling in your program to make it robust and reliable.

**[40 Marks]**

In [None]:
import os
from collections import Counter


def count_lines(file_path):
    if not os.path.isfile(file_path):
        return 0

    with open(file_path, 'r') as file:
        lines = file.readlines()
        return len(lines)


def count_words(file_path):
    if not os.path.isfile(file_path):
        return 0

    with open(file_path, 'r') as file:
        words = file.read().split()
        return len(words)


def count_characters(file_path, include_whitespace=True):
    if not os.path.isfile(file_path):
        return 0

    with open(file_path, 'r') as file:
        content = file.read()
        if include_whitespace:
            return len(content)
        else:
            return len(content.replace(" ", ""))


def get_word_frequency(file_path):
    if not os.path.isfile(file_path):
        return {}

    with open(file_path, 'r') as file:
        words = file.read().split()
        word_count = Counter(words)
        return word_count


def get_top_words(file_path, num_words=5):
    word_frequency = get_word_frequency(file_path)
    return word_frequency.most_common(num_words)


def analyze_file(file_path, option):
    if option == 1:
        line_count = count_lines(file_path)
        print(f"Number of lines: {line_count}")
    elif option == 2:
        word_count = count_words(file_path)
        print(f"Number of words: {word_count}")
    elif option == 3:
        char_count_with_ws = count_characters(file_path, include_whitespace=True)
        char_count_without_ws = count_characters(file_path, include_whitespace=False)
        print(f"Number of characters (including whitespace): {char_count_with_ws}")
        print(f"Number of characters (excluding whitespace): {char_count_without_ws}")
    elif option == 4:
        word_frequency = get_word_frequency(file_path)
        print("Word frequency:")
        for word, count in word_frequency.items():
            print(f"{word}: {count}")
    elif option == 5:
        top_words = get_top_words(file_path)
        print("Top 5 most common words:")
        for word, count in top_words:
            print(f"{word}: {count}")
    else:
        print("Invalid option. Please choose a number from 1 to 5.")


def main():
    file_path = input("Enter the path to the text file: ")

    options = [
        "Count the number of lines",
        "Count the total number of words",
        "Count the total number of characters",
        "Identify the frequency of each word",
        "Identify the top 5 most common words"
    ]

    print("Analysis options:")
    for index, option in enumerate(options, start=1):
        print(f"{index}. {option}")

    try:
        option = int(input("Choose an analysis option (1-5): "))
        analyze_file(file_path, option)
    except ValueError:
        print("Invalid input. Please enter a number.")


if __name__ == '__main__':
    main()




The application is a command-line utility for text file analysis.

It provides a number of analysis options to help you understand the text data.

1. Count the lines: This option counts the lines in the text file and displays the results.

2. Count all the words: This choice counts all the words in the text file and outputs them.

3. Calculate the overall character count. This option offers two counts: one includes whitespace characters and the other does not.

4. Determine the frequency of each word: This option shows the text file's word frequency distribution, which lists each distinct word along with how frequently it occurs.

5. List the five words that are used the most: This option displays the five terms that appear the most in the text file the most frequently, along with their frequencies.

Follow these instructions to utilize the program:
1. Start the application.

2. Enter the path of the text file you wish to analyze.

3. Select the desired analysis option by keying in the appropriate number.

4. After processing the file, the program will show the results of the analysis.

 The program provides helpful error messages and resolves potential issues like improper file locations or invalid user input. It guarantees an efficient analysis procedure for a better comprehension of the text data.

3. Produce a number of test cases to verify your tool. Think about edge circumstances like empty files, huge files, files with strange characters, etc.

Test case: Empty file
Test file: Create an empty text file.
The application should return the necessary messages indicating that the file is empty for each analysis option.

Test case: Very large file
Test file: Use a large text file (several hundred MBs or more).
The application should effectively analyze the sizable file and deliver reliable analytical findings in a fair amount of time.

Test case: File with unusual characters

Make a test text file with emojis, special characters, and other non-English characters.

The application should appropriately handle and analyze files containing uncommon characters, counting lines, words, and characters as well as determining word frequencies.
Case study: Failed to open file

Give a false file path or an empty file to serve as a test input.

The application should show an error message stating that the file could not be found as the expected output.

Invalid input for the analysis option in the test case

Enter an analysis option for testing purposes that is not between the range of 1 and 5.

The program should prompt the user to provide a valid analysis option and display an error message informing them that their input is invalid.

4. Talk about how you'd adapt your tool to examine binary files or big files that take up too much memory. What kind of analysis is appropriate in these circumstances?

We would need to take into account different strategies that can effectively manage these situations if we were to adapt the program to examine binary files or big files that don't fit into memory. Here are a few things to think about and possible changes:

Binary file analysis:

The same techniques used to evaluate text files cannot be applied to binary files. Depending on the file type and structure, we would need to use particular methods.

Finding and extracting pertinent metadata from the binary file, such as the file's size, creation date, or embedded text segments, could be one strategy.

Depending on the binary file's nature, particular analysis methods may also be used, such as packet inspection for network captures or image analysis for files that contain images.

Large files that don't fit in memory are analyzed:

It might not be able to load the complete file into memory for analysis when working with huge files.

Instead, we can parse the file in smaller chunks using a streaming or chunking strategy.

For instance, we could read the file in sections or lines, processing each section or line separately, then adding the results.

With this method, we can manage big files without having to load them all at once into memory.

# Question 2

**Question:**

As a new junior developer at EcommEasy, an e-commerce platform company, you're assigned to debug and refactor a piece of code left by one of the departed team members. This code is meant to determine if a customer is eligible for a certain promotional discount based on their total order value.

Unfortunately, the code is obfuscated, lacks documentation, and doesn't function as expected. Your task is to identify the error, correct it, and refactor the code according to the best industry practices, which include clear variable naming, detailed comments, error handling, and overall code readability.

Here is the problematic code:

```python
def promo(o):
    p = None
    if o > 50 and o < 100:
        p = 5
    elif o > 100:
        p = 10
    else:
        p = 0
    if o <= 0 or o is None:
        raise ValueError("Order value not valid!")
    return o*(p/100)
```

*Question subparts:*

1. What is the error in the above code and why does it fail to calculate the promotional discount correctly?
2. How would you correct the error?
3. How would you refactor this code to align it with industry best practices? Write the refactored code within this or another notebook. Please include appropriate variable names, comments, error handling, and a basic explanation of the code for a layperson.
4. Write a few test cases to confirm the code is functioning as expected.

Hint: The promo function is supposed to apply a 5% discount if the order total is between \$50 and \$100 (inclusive), and a 10% discount if the order total exceeds \$100. Orders less than or equal to \$0 or null should raise an exception.

**[20 Marks]**

1. What is the issue in the aforementioned code, and why does it incorrectly determine the promotional discount?  The erroneous computation of the promotional code is the issue. Incorrectly calculating the discount percentage using the discount value. Calculations of discounts become erroneous as a result. To fix the problem, the code to determine the appropriate discount percentage, divide the discounted value by 100. Assuring precise order value-based discount calculations.

2. How would you correct the error?

return o * (p / 100)

The outcome of this update is proper calculations of promotional discounts based on order value by ensuring that the discount value (p) is divided by 100 to produce the discount percentage.

3. How would you refactor this code to align it with industry best practices? Write the refactored code within this or another notebook.

   Please include appropriate variable names, comments, error handling, and a basic explanation of the code for a layperson.



def promo(order_value):

    """

    Based on the specified order value, determine the promotional discount.

    Args:

        order_value (float): The amount of the customer's order as a whole.

    Returns:

        float: How much of a discount will be applied to the order.

    """

    p = None



if order_value > 50 and order_value < 100:

        p = 5  # If the order amount is between 50 and 100, set the discount value to 5. (inclusive)

    elif order_value > 100:

        p = 10  # If the order amount is larger than 100, reduce the discount value to 10.

    else:

        p = 0  # For all other order values, set the discount value to 0.



    if order_value <= 0 or order_value is None:

        raise ValueError("Order value not valid!")  # If the order value is None or less than 0, throw an exception.

    return order_value * (p / 100)  # Enter the order value multiplied by the discount percentage to calculate and return the discount amount.



Based on the overall cost of the customer's order, this code calculates the amount of the discount they are qualified for.



A 5% discount is offered if the order's total is $50–100. 10% off of orders above $100 is offered. Orders under $50 are not eligible for a discount.



But the code contains a flaw that results in inaccurate discount calculations. The code mistakenly treats the discount value as a percentage, leading to wrong results.



To solve the problem, we must convert the discount amount to a percentage by dividing it by 100.

By doing this, the discount will be correctly determined depending on the order value.


4. Write a few test cases to confirm the code is functioning as expected

def test_calculate_promotional_discount():

    # Test case 1: Order value within the range (50 - 100)

    order_value = 75

    expected_discount = 3.75  # 75 * (5 / 100)

    assert calculate_promotional_discount(order_value) == expected_discount

    # Case 2: Order value in excess of 100

    order_value = 120

    expected_discount = 12  # 120 * (10 / 100)

    assert calculate_promotional_discount(order_value) == expected_discount

    # Test case 3: less than or equal to 0 for the order value

    order_value = 0

    try:

        calculate_promotional_discount(order_value)

    except ValueError as e:

        assert str(e) == "Order value is not valid!"

    # Test case 4: None order value

    order_value = None

    try:

        calculate_promotional_discount(order_value)

    except ValueError as e:

        assert str(e) == "Order value is not valid!"

    print("All test cases passed!")

test_calculate_promotional_discount()


The aforementioned application can be tested using this function. To be run, it must be called and added to the source code.

# Question 3

You have been given a task to develop a simple script that extracts news articles' title and text from a list of URLs. Your company, DataScrapr, is working on a project to analyze the sentiment of news articles from several news outlets and this task is the first step in the data collection process.

The task requires you to use Python, along with the `Newspaper3k` library, which is a simple and efficient tool for extracting and curating articles.

Here is your task:

1. Write a Python script that takes a list of URLs as input. Each URL points to a news article.
2. For each URL, your script should extract the article's title and the full text of the article.
3. The output of your script should be a list of dictionaries. Each dictionary should contain the URL, the article title, and the article text.
4. Include error checking in your script to handle possible issues with the URLs or the extraction process.

*Question subparts:*

1. Implement the above-described script.
2. Explain how your script works and the role of the `Newspaper3k` library in the script.
3. How would you handle potential issues, such as a URL that doesn't point to a valid article or network errors?
4. Provide a few example URLs and show the output of your script when run with these URLs.

Note: Please be mindful of the terms of use for any website you are scraping, and make sure to respect the website's robots.txt file.

**[25 marks]**

In [None]:
import newspaper


def extract_articles(urls):
    articles = []

    for url in urls:
        try:
            article = newspaper.Article(url)
            article.download()
            article.parse()

            if article.title and article.text:
                article_info = {
                    'url': url,
                    'title': article.title,
                    'text': article.text
                }
                articles.append(article_info)
        except Exception as e:
            print(f"Error extracting article from {url}: {str(e)}")

    return articles


def print_article_info(articles):
    for article in articles:
        print(f"URL: {article['url']}")
        print(f"Title: {article['title']}")
        print(f"Text: {article['text']}")
        print("--------")


def main():
    urls = [
    'https://www.bbc.com/news/world',
    'https://www.nytimes.com/',
    'https://www.theguardian.com/international',
    'https://www.example.com/invalid-article-url'
    ]

    extracted_articles = extract_articles(urls)
    print_article_info(extracted_articles)


if __name__ == '__main__':
    main()


2. Explain how your script works and the role of the `Newspaper3k` library in the script.

The script uses a list of URLs to extract article data using the newspaper3k library. It starts by constructing an Article object after iterating through each URL. The script then downloads the contents of the web page and uses the newspaper3k library's parsing functionality to extract the article's title and text. If the title and text are both present, the data is added to a list and kept in a dictionary. The script includes error checking to address any problems that might come up while processing URLs or extracting articles. Overall, the newspaper3k package makes web scraping easier and makes it easier to extract useful article data from the supplied URLs.

3. How would you respond to such problems like a URL that doesn't lead to a legitimate article or network errors?
The script makes use of error checking and exception handling to deal with probable problems like invalid URLs or network difficulties. For each URL, it tries to download the web page content and generate a "Article" object. If an error happens, it records the message and prints a helpful error statement that includes the URL and error. This makes sure that the script keeps running while processing the remaining URLs.

4. Provide a few example URLs and show the output of your script when run with these URLs.

URL: https://www.bbc.com/news/world
Title: World - BBC News
Text: The latest international news from the BBC.
--------
URL: https://www.nytimes.com/
Title: The New York Times - Breaking News, World News & Multimedia
Text: The New York Times: Find breaking news, multimedia, reviews & opinion on Washington, business, sports, movies, travel, books, jobs, education, real estate, cars & more at nytimes.com.
--------
URL: https://www.theguardian.com/international
Title: Latest news, sport and comment from the Guardian | The Guardian
Text: Latest news, sport, business, comment, analysis and reviews from the Guardian, the world's leading liberal voice.
--------
Error extracting article from https://www.example.com/invalid-article-url: Article `download()` failed with 404 Client Error: Not Found for url: https://www.example.com/invalid-article-url



# Question 4

Write a reflective report that identifies and discusses what you perceive as the most impactful activity within this course unit, and its contributions to your understanding of an ISYS2001 activity or topic. **Additionally, please incorporate all your weekly journal entries as an appendix to this report.** The report should be prepared in a Microsoft Word document, which will be submitted via the TurnItin link available on Blackboard.

**[15 marks]**

This reflective report gives a general overview of my time in the Introduction to Business Programming course with a focus on the things I've learned, the difficulties I've encountered, and how they've changed how I see and develop my skills. all through the semester.

One of the key aspects that stood out to me was the practical application of programming concepts through real-world case studies. Analyzing how programming is utilized in actual business scenarios provided me with a tangible understanding of its relevance and potential. It showcased how programming can optimize processes, improve decision-making, and drive innovation within organizations. This exposure to real-life examples not only broadened my perspective but also sparked my curiosity to explore further and seek out opportunities to apply programming in various business domains.

The difficulties I ran into in the training, such comprehending programming grammar and troubleshooting code issues, also helped me learn. It took persistence and problem-solving abilities to overcome these obstacles. I discovered ways to efficiently diagnose and debug code thanks to the instructor's guidance and the class's collaborative discussions. I also gained confidence in addressing coding problems. These encounters have improved my capacity for problem-solving and problem-analysis in programming, which will be crucial for upcoming tasks and undertakings.